Why does a data scientist need to know how to program?

"she's not your grandfather's statistician"

Paul HouleCreator of database animals and bayesian brains
December 31, 2013

The "data scientist" title has arisen because we often find that people who are "data analysts" and "statisticians" don't have all the skills to maximize the value of their talents.

For instance, a data scientist typically works in an organization that has software development and a production system. The insights the data scientist produces are based on data from the production system and other places. These insights must be available to managers, software people, marketers and possibly end users and they need to be turned around quickly so the organization can stay ahead of the competition.

Rather than just "doing an analysis", a data scientist will package up their expertise and knowledge into a system. This way, as new data comes in, models can be updated and reports generated on a real-time, hourly, daily, or weekly basis depending on the needs.

Python is a good language for doing that kind of automation.

4GL tools like SQL often run into limits in what they can do. For instance, I always find myself looking at cumulative probability distributions. You can generate these in Oracle SQL, but not in all version of SQL. It is easy to write a Python script which processes SQL output and computes the distribution however.

Data scientists need to know about software engineering because they need to build things quickly that are built to last and will be worked on by other people. For instance, models and algorithms they development may go into a production system and the more quickly and easily that happens, the better. Often a data scientist moves on to another role and they leave behind a bunch of R scripts, Excel spreadsheets and other documents that need to be rewritten by the next person, and you don't want that.

For an example of this teamwork, I worked with a geospatial analyst to incorporate a model he developed into a customer's CRM system. In this case he did the analysis and I did the work of automation and integrating with the API. We both had enough insight into the other person's needs that this collaboration worked well, we fixed the glitches that came up and made the adjustments necessary to fit the customer's needs.

Thus data scientists need to know about using version control, separating configuration from code so you can point the system at a development database or the production database without hacking the source code or compromising security. They need to know how to use an issue tracking system. They should understand and be able to work in a Scrum and agile environment but they should also understand why many processes that work for software development run into problems with work that has a science component and be able to engage the team to change the process when the process gets in the way of moving the football down the field.

I would say C is a stand-in for knowing about systems programming, and the issue there comes in when you need to do optimization work to commercialize an algorithm. For instance, me and two other team members prototyped an algorithm in two or three days that was really amazing, but it would have taken 100 years to run it against all our data with the (beefy) hardware we had and we only had two weeks to get it into the product.

On the flight back home I did the math and developed an approximation and in the two days after that I developed an efficient implementation and that could do the calculation in eight hours on a laptop. There are lots of approaches to performance, such as knowing about mathematics and algorithms, careful coding, threading, SIMD instructions, GPU computing, Map/Reduce, FPGA, etc. Somebody who works on video game engines or codebreaking probably knows this better than a data scientist, but a data scientist should have some skills in this area.

Paul Houle

Creator of database animals and bayesian brains

comments powered by Disqus
© 2014 Paul Houle