Friday, February 9, 2018

"You're out of order!"

If you've ever seen the movie ...And Justice for All, you will do doubt vividly recall Al Pacino delivering that famous line during the film's climatic courtroom scene. This post is about a different kind of order -- the order of sorted data.

I was almost tripped up by order recently while munging some survey data for analysis. The survey had been taken by representatives of various academic and service units on the university campus and this was the second year it had been administered. I was attempting to match the current year's respondents to previous respondents to investigate change from one year to the next. This was not the straight-forward process that I assumed it would be, as some of the drop-down options for the "demographic" (in terms of where the unit is in the university hierarchy) questions had been altered between the two years. In addition, folks sometimes filled out a single survey to cover multiple units, so one had to look at a comment box to see if they had made any annotation to that effect (e.g., "this survey also covers unit X, Y, and Z"). Because there was no guarantee that the same units had responded to the survey each year, this also meant that one had to recognize that units with similar names across the two surveys may not refer to the same entity (e.g., Chemical and Molecular Bioengineering, Biochemistry, and Bioengineering are not necessarily the same thing). 

I spent a bit of time trying to come up with a principled way to do this in R, but decided that it might be more efficient to just do things "by hand" as there were only about 100 survey responses. The data were in .csv format, so I grabbed the relevant columns from each survey, pasted them into separate workbooks in Excel, and compared them side by side. Each time I found what I determined was a "match," I would grab the unique id for that respondent in the current survey and paste it into a new column in the other sheet. Once I had matched all that I could, I pasted that new column into the survey response data from the previous year. That column was the link between the current year and the previous year.

At a couple of points along the way, I sorted the Excel sheets (sometimes on multiple columns) to make this visual comparison easier. For better or for worse, Excel does what you tell it to (usually). If you sort on multiple columns, and then later on a single column, Excel will maintain the current order for everything other than the column you are sorting on. Though I re-sorted on what I thought was the "main" column before pasting the final results into the original spreadsheet, I didn't double-check the final order. Had I done so, I would have realized that things were still not quite back in the "original" order. It was only at a later stage in the analysis that I realized my mistake. Ouch.

The moral of the story is to always keep track of any data permutations you are performing, particularly when using tools that do not naturally leave a digital trail. (This is why I prefer to work in R or Python and make modifications to data objects rather than the original data source -- you can always modify the script and rerun the process staring from the "original"data.) Had I simply added an "original position" column to the Excel sheet before doing any sorting, I could have easily sorted on that to get back to the starting state.

No comments: