Python – Programming cleanly when writing scientific code

clean codecoding-stylepythonpython-3.x

I don't really write large projects. I'm not maintaining a huge database or dealing with millions of lines of code.

My code is primarily "scripting" type stuff – things to test mathematical functions, or to simulate something – "scientific programming". The longest programs I've worked on up to this point are a couple hundred lines of code, and most of the programs I work on are around 150.

My code is also crap. I realized this the other day as I was trying to find a file I wrote a while ago but that I probably overwrote and that I don't use version control, which is probably making a large number of you cringe in agony at my stupidity.

The style of my code is convoluted and is filled with obsolete comments noting alternate ways to do something or with lines of code copied over. While the variable names always start out very nice and descriptive, as I add or change things as per e.g., something new that someone wants tested, code gets overlayed on top and overwritten and because I feel like this thing should be tested quickly now that I have a framework I start using crappy variable names and the file goes to pot.

In the project I'm working on now, I'm in the phase where all of this is coming back to bite me in a big way. But the problem is (aside from using version control, and making a new file for each new iteration and recording all of it in a text file somewhere, which will probably help the situation dramatically) I don't really know how to proceed with improving my actual coding style.

Is unit testing necessary for writing smaller pieces of code? How about OOP? What sorts of approaches are good for writing good, clean code quickly when doing "scientific programming" as opposed to working on larger projects?

I ask these questions because often, the programming itself isn't super complex. It's more about the math or science that I'm testing or researching with the programming. E.g., is a class necessary when two variables and a function could probably take care of it? (Consider these are also generally situations where the program's speed is preferred to be on the faster end – when you're running 25,000,000+ time steps of a simulation, you kinda want it to be.)

Perhaps this is too broad, and if so, I apologize, but looking at programming books, they often seem to be addressed at larger projects. My code doesn't need OOP, and it's already pretty darn short so it's not like "oh, but the file will be reduced by a thousand lines if we do that!" I want to know how to "start over" and program cleanly on these smaller, faster projects.

I would be glad to provide more specific details, but the more general the advice, the more useful, I think. I am programming in Python 3.


Someone suggested a duplicate. Let me make clear I'm not talking about ignoring standard programming standards outright. Clearly, there's a reason those standards exist. But on the other hand, does it really make sense to write code that is say OOP when some standard stuff could have been done, would have been much faster to write, and would have been a similar level of readability because of the shortness of the program?

There's exceptions. Further, there's probably standards for scientific programming beyond just plain standards. I'm asking about those as well. This isn't about if normal coding standards should be ignored when writing scientific code, it's about writing clean scientific code!


Update

Just thought I'd add a "not-quite-one-week-later" sort of update. All of your advice was extremely helpful. I now am using version control – git, with git kraken for a graphical interface. It's very easy to use, and has cleaned up my files drastically – no more need for old files sticking around, or old versions of code commented out "just in case".

I also installed pylint and ran it on all of my code. One file got a negative score initially; I'm not even sure how that was possible. My main file started at a score of ~1.83/10 and now is at ~9.1/10. All of the code now conforms fairly well to standards. I also ran over it with my own eyes updating variable names that had gone…uhm…awry, and looking for sections to refactor.

In particular, I asked a recent question on this site on refactoring one of my main functions, and it now is a lot cleaner and a lot shorter: instead of a long, bloated, if/else filled function, it is now less than half the size and much easier to figure out what is going on.

My next step is implementing "unit testing" of sorts. By which I mean a file that I can run on my main file which looks at all the functions in it with assert statements and try/excepts, which is probably not the best way of doing it, and results in a lot of duplicate code, but I'll keep reading and try to figure out how to do it better.

I've also updated significantly the documentation I'd already written, and added supplementary files like an excel spreadsheet, the documentation, and an associated paper to the github repository. It kinda looks like a real programming project now.

So…I guess this is all to say: thank you.

Best Answer

This is a pretty common problem for scientists. I've seen it a lot, and it always stems by the fact that programming is something you pick on the side as a tool to do your job.

So your scripts are a mess. I'm going to go against common sense and say that, assuming you're programming alone, this is not so bad! You're never going to touch most of what you write ever again, so spending too much time to write pretty code instead of producing "value" (so the result of your script) isn't going to do much to you.

However, there is going to be a time where you need to go back to something you did and see exactly how something was working. Additionally, if other scientists will need to review your code, it's really important for it to be as clear and concise as possible, so that everyone can understand it.

Your main problem is going to be readability, so here's a few tips for improving:

Variable names:

Scientists love to use concise notations. All mathematical equations usually use single letters as variables, and I wouldn't be surprised to see lots and lots of very short variables in your code. This hurts readability a lot. When you'll go back to your code you're not going to remember what those y, i and x2 represent, and you'll spend a lot of time trying to figure it out. Try instead naming your variables explicitly, using names that represent exactly what they are.

Split your code into functions:

Now that you renamed all your variables, your equations look terrible, and are multiple lines long.

Instead of leaving it in your main program, move that equation to a different function, and name it accordingly. Now instead of having a huge and messed up line of code, you'll have a short instructions telling you exactly what's going on and what equation you used. This improves both your main program, as you don't even have to look at the actual equation to know what you did, and the equation code itself, as in a separate function you can name your variables however you want, and go back to the more familiar single letters.

On this line of thought, try to find out all the pieces of code that represent something, especially if that something is something you have to do multiple times in your code, and split them out into functions. You'll find out that your code will quickly become easier to read, and that you'll be able to use the same functions without writing more code.

Icing on the cake, if those functions are needed in more of your programs you can just make a library for them, and you'll have them always available.

Global variables:

Back when I was a beginner, I thought this was a great way for passing around data I needed in many points of my program. Turns out there are many other ways to pass around stuff, and the only things global variables do is giving people headaches, since if you go to a random point of your program you'll never know when that value was last used or edited, and tracking it down will be a pain. Try to avoid them whenever possible.

If your functions need to return or modify multiple values, either make a class with those values and pass them down as a parameter, or make the function return multiple values (with named tuples) and assign those values in the caller code.

Version Control

This doesn't directly improve readability, but helps you do all the above. Whenever you do some changes, commit to version control (a local Git repository will be fine enough), and if something doesn't work, look at what you changed or just roll back! This will make refactoring your code way easier, and will be a safety net if you accidentally break stuff.

Keeping all this in mind will allow you to write clear and more effective code, and will also help you find possible mistakes faster, as you won't have to wade through gigantic functions and messy variables.

Related Topic