Friday, June 4, 2021

Another Scoop of Ice-cream

This is a follow-on from the last post where I talked about projects things that involve "data" may also require skills that are not necessarily part of the data analyst/scientist tool box (even though others, and even ourselves, may assume that we have them). This seems to be particularly noticeable with areas that are adjacent to, or even overlapping with, some of our daily activities, specifically around infrastructure and coding.   

I've been writing scripts in R and Python for a number of years, but never trained as a software developer and do not have a background in computer science. For most of what I need to do, this has not seemed to be much of an issue. But, of course, I don't know what I don't know, and that can be dangerous. Like so many other areas of endeavor, you can get pretty far with a surface understanding. (A Pareto principle-like idea, 80% of the problems can be solved with 20% of the knowledge/techniques.)

The unfortunate side effect of this, however, is that there are times when the techniques that work fine for developing a small, local script are completely inappropriate for other situations. A fairly trivial example is debugging an issue through the use of "print" statements or just inspecting program output through an IDE like RStudio. This is fine if you are sitting at the keyboard and able to watch the program as it runs, but completely useless if your program is running on unattended on a server and triggered by a scheduler. Unless you have become familiar with logging and error handling, you are flying blind. 

In the same vein, rather "inefficient" (from a computer science point of view) programming techniques may go virtually unnoticed when working with a relatively small data set locally. That your for-loop takes .2 seconds to iterate through the 100 records in the single file that you are working with locally may not even be noticeable. However, if you use that same technique to iterate over the records in 19,000 files you are now contending with over an hour of processing time. This can become a real issue if you are paying a per minute computing charge in a cloud environment. 

The other unfortunate side effect is that our estimates of effort and how close we are to "ready" can become very biased. Once we have our local process working, we get the false sense that we have conquered the hard part (and we probably have, in terms of our portion of the work). But because we are only familiar with our own working context, we might greatly underestimate how close to "production ready" our locally-produced code or data process may actually be. It is like looking at the tip of the iceberg and ignoring everything under the water.  Moving from a desktop to the enterprise may not be a matter of "just" automating or "just" scaling up what we have working locally -- it may be much more complicated and involve all sorts of issues, contingencies, and edge cases that we never even contemplated. 


No comments: