Monday, June 21, 2021

The Bus Stop Problem

I ran into a fairly familiar conundrum (for me) the other day -- how long to wait for a long-running process to finish before assuming that there is something wrong? The details of the particular case are not important, but it had to do with doing with creating a Redshift table from a materialized view using INSERT INTO. I knew that the query would take some time, so I kicked it off and then went on to other things, checking back every once in a while. 

After an hour or so, I began to get a little nervous. Not having any admin access to Redshift itself, I had no real way of assessing progress. I was hoping that the job would finish within an hour, but I also knew that we had other operations that took several hours to complete, so I figured that this might take more time than hoped.

So I waited.

An hour later, I checked in again. Still not finished. Hmm. Maybe three hours is the magic number? I decided to be patient. 

As I kept checking in on my always-in-process query throughout the day, it struck me that I was feeling very much like someone waiting at a bus stop for the last bus of the evening. If you get to the bus stop right on time, but the bus isn't there, you start to wonder if it already came already (in which case you will be waiting all night) or if it is just running a few minutes late. The more you wait, the more nervous you become that you may be too late. But you also don't want to start walking away from the bus stop only to have the bus zoom by without seeing you. Do you wait, or do you walk?*  

I have a friend and former colleague, we'll call him "Bill", who rarely seems to experience this problem. As a long-time IT guru, he has a much better understanding of how various systems interact with each other, what the hardware specs of the machines he is working on are, what the theoretical limits are for certain operations, and how to combine all of that information into a reasonable estimate of expected performance. I've often marveled at how quickly after kicking off a process he will say something like "Hmmm, we should be getting much better throughput than that. I wonder if X, Y, or Z is happening?" I wish I had that ability.

Long story short, I ended up canceling the job after many more hours than I'd care to admit. I don't know for sure, but I think that there was a connection error early on and I was really waiting for the DB management tool to re-establish the connection rather than for the job to finish. Bill would know.....

*My analogy probably works better for someone like me who didn't grow up in a world with Uber and Lyft just a click away.

No comments: