Programmers vs. Scientists on Coding
Just read this great short tidbit about how programmers see software versus how scientists see software.
http://www.johndcook.com/blog/2011/07/21/software-exoskeletons/It's not something I ever thought about, not having worked with scientists, but it makes total sense. A scientist just wants to get their results. They can open a Python shell, import numpy and scipy, connect to the database, and have the results they need spit out with just a few more lines. That will make them very happy.
But copying and pasting those lines into a text editor is nowhere close to actually making a piece of software that anyone can reuse. There's only about a thousand things that need to be added to it to make it an actual usable program. Validating the input data alone is going to be a ton of work, depending on the complexity.
It's kinda weird when you think about it. You would think that programmers would be the ones who wouldn't mind having such an unfinished program that it would only work when actual programmers are operating it by hand. In reality, programmers are the laziest and they want the software to work on its own without any people touching it. Scientists just want the results as fast as possible regardless of any other factor.
Comments
I actually wrote a python module for biologists to easily write up some primers/probes (genetic forensic stuff) in a Python interpreter, then run a bunch of calculations over it. They didn't want a spit and polished GUI or anything that would be desired for production use. They just wanted to be able to copy genetic sequences, paste it into something, make some modifications, and get results.
Here is an example of programmer (me) doing reusable code (module) for scientists (specifically forensics specialists) to do their own "programming" in an interpreter and get results.
EDIT: oh yeah I forgot to add the conclusion. After I wrote the software, the scientists decided they would simply order pre-made primer/probe sets instead of building them themselves. The software got used a few times to finish up a paper. If they ever start building primer/probe sets again, my software will have been long forgotten (even if it is still useful for the purpose). They might even ask someone else to write basically the same confirmatory code all over again.
Writing a ton of scripts just to get crappy software to work for your specific problem is a necessity.
The two main reasons why this culture of undocumented code and barely functional programs exist are 1) There is no Academic merit for producing readable, maintainable code. You are not going to get more grant money, and your boss will not be happy when you tell him that you spent three weeks longer than necessary on an analysis because you wanted to produce a better tool. 2) There is a disincentive to publish your tools in that you making it easier for other people to publish research in your own area, something you'd rather do yourself.
The underlying combining factor here is the way science funding is structured. You don't get funding for producing good tools.
I could do a whole show about this.
The thing is, almost all scientists wind up having to learn specialized software that fits exactly what they're doing. Thus, because we're learning something new anyhow, it doesn't really matter what we're learning. We just need to know that we're learning how to use the tool that gets us what we want.
Let's say you, as a scientist, secure some grant money to do some project - tracking whales or something. Nobody makes the software you need, so you hire a dude and work with him to get him to build you the software that you need.
You hired a guy to make a custom tool. Then, you learn how to use that tool and use the hell out of it to do the research you need to. This is no different than a factory having a specialty piece of equipment fabricated for them.
This scheme doesn't really include time or money to continue to develop the software beyond its initial use. Because what we need is so very specialized and project-dependent, it doesn't make sense to invest the extra time or money to develop the tool more fully - instead, we build what we need when we need it and use it until we don't need it.
A lot of companies pay programmers lots and lots of money to develop software for scientific applications. It's a lucrative field, because scientists by and large don't have the time or the inclination to troubleshoot software - we just need the thing to work.
Ultimately, this works out in the favor of the programmer, who has a guaranteed source of income from scientists. And it works out for the scientists because it allows us to focus all of our energy on science and not building the tools that we use.
There's a functional limit to the DIY mentality. Yes, you could do it all yourself - but it'd take so much of your time that you wouldn't get done the thing you want to get done in enough time to matter. That's why we work in teams.
Take frontrowcrew.com for example. Back in the day Rym would have to put ID3 tags on the file, upload it to libsyn, copy and paste the URLs back and forth. It was a huge pain, but it got the job done quickly. That's the scientist way. Nowadays you upload the MP3 and it is automatically tagged, uploaded, twittered, bitly'd, and forum'd. Soon it will also be facebook'd. All of that without any human interaction beyond pressing the go button. Sure, it took some time to code that up, but it has already paid off in terms of time savings and aggravation.
Also keep in mind that some of the biggest things in the world were accidental side projects. Twitter was an accidental byproduct of fucking ODEO. Remember that? Django was an accidental byproduct of a Kansas newspaper. This is what happens when you let programmers do what they do. You start out with a treasure map pointing towards X. But along the way you end up finding all sorts of other treasures if you would just stop and dig for a little while. Often those treasures are bigger than the one you were originally after. It's not Alladin's Cave of Wonders. You are allowed to take all the treasure.
Imagine two people have to cross over 100 pits, Pitfall style. The first guy makes a grappling hook in about a minute and starts crossing pits immediately. The other guy sits down and starts building a hook-shot, Zelda style. By the time the hook-shot is done, the first guy is already across half of the pits. His arm is getting tired and his speed at crossing pits is slowing down. Still, he has a huge lead! About a minute later the hook shot guy passes him and then finishes crossing all 100 pits. He then goes back to the beginning and does the whole thing again before the other guy even finishes once. That's the power of letting computers and robots do all your work. It's actually even more powerful with computers since you can buy more computers and have the work being done multiple times simultaneously, depending on the nature of it.
Heavily customized jobs also happen frequently. I get the impression that the author of the article is talking about the latter situation because he is unfamiliar with the former. This is not what the article is saying. From your perspective, the tool is incomplete because you could build additional functionality into it. From the scientist perspective, the tool is complete because it has the functionality we care about. Caring has very little to do with it. Money is the bigger consideration, and everything is a cost-benefit scenario.
I could buy an automated 96-well DNA extractor for $100,000, and the kits to use it for about $1500 for 300 reactions. That adds $5 to the cost of processing a single sample, but might make up for the cost in saved labor. Of course, each 96-well plate is only usable once, so if I use less than a full plate, I'm basically paying more per sample for the extraction.
Manual DNA extraction via spin columns costs me ~$250 for 100 reactions - half the cost of the automated kit. It also maintains its cost per reaction no matter the throughput, because each tube is individualized. And it involves using a centrifuge, which is something we already have.
So when I'm deciding what tool I want to use, I have to do a cost-benefit analysis: will it make sense for me to invest my resources in this shiny thing if I'm only going to use 30% of its capacity? Instead, I'll put for the minimum resources required to get the job done and direct my savings elsewhere.
I understand your perspective - you are a man who builds tools. Your perspective is not the only one, though. As someone who needs to put tools to use, I can tell you that sometimes I don't need a tool that does everything. Sometimes I just need a goddamn screwdriver.
This bonus prize mentality is very prevalent in technology. Take Amazon for example. They started out selling books. To sell books they needed serious hosting. They made serious hosting for themselves. Now Amazon Web Services is a gigantic successful side project from a company that started out selling books. Also, almost every single tech company collects vast amounts of data merely as a side-effect of doing business and having a database. They make lots of money reselling that data to interested third parties, even though their primary business might be movies or some such.
Lots of great scientific discoveries happened by accident when someone was working on something unrelated. I have a feeling that if the mentality described in this article is for real, that lots of potentially great software never came to fruition that could have.
But they're fairy tales at best. Breakthoughs and single geniuses don't actually exist in the way we think they do - they just happen to be convenient milestones. What usually happens is that you'll be doing a series of experiments, get some small unexpected result, and dig deeper. You intentionally investigate the anomalies, explain them, modify your original hypothesis, retest, and draw a new conclusion.
tl;dr: "Eureka" moments don't happen. Yes, we are interested in setting goals and taking discrete steps to get to those goals. The problem is that your "bonus prize" concept is fundamentally inconsistent with how scientific investigations work.
I have a question that I'm investigating. I don't know what the answer is. I need a tool to answer the question. How am I supposed to build the tool to account for subsequent investigations when I don't yet know what those investigations will be? This is why science progresses the way it does; we draw conclusions and then figure out what those conclusions mean before proceeding.
Yes, we have a narrow view because that view is fundamental to proper science. You ask a specific question and design an experiment to control very specific variables so you can draw a very specific conclusion. This is why scientists are endlessly frustrated at the non-scientist; you literally do not understand how specific the inquiry process is.
If research scientists treated code as they do actual research (i.e. reading up on what's out there and confirming it or building off of it), I believe efficiency could be increased. However, doing so in a way that doesn't bog down one scientist from his or her work does require a better understanding of computer science; not just programming, but computer science.
Think of it: the scientist needs code for a short time for a particular use. In order to benefit other scientists, this one needs to understand software engineering well enough to make a tool that suits his or her particular use and could be used by others in the future with maybe a slight modification. Modular programming is a good example of a concept that would need to be understood and utilized to make this code sharing realistic. These are things that tend to differentiate a programmer from a computer scientist.
The good news is that computer science (and not just programming) is becoming more and more ingrained in the education of everybody in the sciences and will continue to be so as computer programming becomes more ubiquitous.
See how many different systems you can run Linux on, including toasters? You think Linus was thinking about playing Doom on smart phones when he wrote Linux in the first place? Hell no. He just wrote it in such a way that it was incredibly flexible and reusable. That's why the same software can be in your home router, your television, your phone, and the server hosting this forum.
At work all the time people will ask for the software to do something new. I will often, but obviously not always, say that it already does that. Nobody ever asked for that feature before. Nobody ever used that feature before. The functionality just already exists as a byproduct of the software being well designed. If you write your software properly now, it will also be the software you need later, even though you don't know what you will need later.
I think that blog poster probably doesn't work closely enough with scientists in the field to actually understand what they're doing. This is a very very common issue between scientists and software developers - which is why some really big companies have their own programmers.
If a programmer looked at my scripts they'd be horrified! If anyone else tried to use them, they wouldn't have a clue what to do! Am I interested in making a GUI? Nope! Am I wanting it to work with other programs? Nope! In each case I just copy and paste the results, or upload the resulting batch of html files, and I'm happy. I don't want results as fast as possible, but I do like to be in full control of what results I do get, and the simplest way for me to get those results is for me to be in control of the entire process, that means the scripts have to be simple enough for me to write and understand everything that's going on, which means the last thing I want is a programmer getting their hands on the scripts and making them "better".
While I studied CS at uni, we were frequently told "If you want to learn how to program, get an associates or go to trade school." Many professional programmers did just that.
1) Transparency and peer review. It is highly questionable to claim that a closed source chunk of bits produced the results you just published and that no one can look at the source to verify the analysis. Opening up the source makes it pretty much impossible to monetize a program, since, as a scientist, you don't want to / can't monetize on any service associated with the software.
2) You get funding to do research. If you do computational tools instead then a) you won't get any more funding, b) income from the tools is either illegal or doesn't belong to you.
3) The theory and the results are cool but the bits in between are booooooooring. Most scientific computations are trivial and optimal implementations are known. Apart from problems associated with scale and complexity that grows with scale there is not that much pure CS innovation going on.
4) The theory and the results are cool only in your own opinion. Most scientific computations do not apply to anything even remotely marketable.
Scientific computing serves science first and foremost, it cannot be easily made to serve other purposes. Until science funding changes so that scientists actually benefit from creating extensible, maintainable, and transparent code, scientific programs will not be any of those things.
3 and 4 I got nothing.
The only other thing I can add is that at the very least you could contribute your code to an existing open source project like scipy or something. Then you'll at least be forced to get it into good enough shape that they will accept your patch, and somebody elsewhere might reuse it. Better yet, you can save all the time in the world if you find an open source project that already has what you need.
To give an example, I was working in a Biophysics lab and they needed a tool to do image analysis. They needed to measure the size of some vesicles, pipet sizes, the amount of the vesicle in the pipet, and often there's a bead that needs to be tracked. They had code that did some of this in Matlab, and really the code I had to write didn't need to be much more complicated. So I write this code and it's generating data. After this point what is going to be worth the effort of publishing this in any meaningful way? Most of the background code is common knowledge algorithms, and most of the front end is very application specific. I could sit down and write a GUI to make the code idiot useable, but all of the people working with it know whats going on. They can read code, I can just leave notes in the code so they can optimize the code for a specific run. Any further development of this tool, is time better spent making more tools.