Saturday, June 5, 2021

Don't Forget the Cone

In keeping with the ice-cream metaphor, this is yet another follow-on from a post a few days ago that talked about creating logic and a proof-of-concept script that would check for LMS course completion (to identify students eligible for a free scoop of ice-cream). 

I had created a script to 1) get the most recent roster for the target course (as students could self-enroll), 2) check that roster against a list of previously identified finishers and filter them out, and 3) check the remaining students in the roster to get their completion status. As the script runs, it saves information to three csv files: the current roster for the course (overwritten each time the script is run), a running list of students who had finished the course (appended to each time the script is run), and a "recent completers" list of those students who had finished the course since the last time the script was run (created with a timestamp each time the script is run, but empty if there are no new "completers"). 

Although my script is not being used for the official project (it was only intended to illustrate how the logic might work), I've been running it a couple times a day to see how it performs now that the course is live. 

The first time I ran it, as expected, I noticed a few places where my script was not quite working correctly. Because I only had access to finished courses when writing it, the number of completions never changed no matter how many times it ran. Once I pointed it to the live class, I noticed that it was appending the complete list of "completers" each time rather than just the "delta" between the first and last time it was run. A few code tweaks here and there seemed to get things working as intended.

The next day, I ran the script again. It caught the updated class roster, correctly identified previously completed students, and saved everything where it should be. (Because the class is targeting incoming students, the roster expands rather quickly as they self-register.) I ran the script again a few hours later and it worked perfectly again. 

I was pretty pleased with myself. 

The next day, I ran the script first thing in the morning, and again it worked just as expected. By this point, I was becoming fairly convinced that the only potential glitch remaining would be if the number of students got too large there might be a bottleneck when checking completion status, as that is a separate API call for each student. I added a simple "end time - start time" calculation to keep track of how long things were taking. I ran the script again a couple of times throughout the day. So far, no problems.

Now I was really pleased with myself. 

The next day, I again ran the script first thing in the morning, and again it worked flawlessly. The overall time that the script was taking was slowly increasing as more and more students self-enrolled, but the rate was much less than I expected because I was filtering out those students who were already finished, so the overall number to check was still quite manageable. 

And then it happened.  

I ran the script again a little later in the day. I had a couple of print statements here and there so I could get a sense of where things were in terms of execution, and one that I didn't expect to see popped up: "No existing completions found."  Huh? That should only happen the first time I run the program....

One of the first things that the program does it to check for the "completed students" file so it can use it to filter out those students whose completion status has already been verified. Because, by definition, there will be no completed students the first time the program is run, I wrapped the piece of code that reads the file in a try-except block because it will throw an error if it tries to read a non-existent file. The "No existing completions found" message is from the error-handling part, which prints out that message and then creates an empty "completed students" list. I got that message the very first time I ran the script for the "live" course, but that was to be expected and I hadn't seen it since....

I looked at the directory where the output files were stored. Perhaps I had inadvertently deleted the file? Nope, everything seemed to be there. In fact, because my code was written to handle the exception (*any* exception), produce the message, and continue on, and because the completed student file appends results, I now had duplicates of "completers" in the saved file.

I was beginning to feel a little less smug (and quite grateful that my script wasn't the linchpin in the actual project -- see my other ice-cream post). 

So, what had happened? After a bit of Googling and head-scratching, it appears that the issue was one of file encoding. My script was failing to read the file because the Pandas read_csv default is utf-8. However, one of the records must have included a special characters (probably in a student name) that caused an error reading the file, throwing an exception. Explicitly setting the encoding to ISO-8859-1 in the read_csv statement removed the error. 

There are actually two morals to this story. The first is that the difference between a fragile script and a bullet-proof script may not be apparent until an unanticipated edge-case appears. Not being a software developer, it never occurred to me to think about things like file encoding. Just because a script can run for a few days without issue does not mean that it is production ready (as we have just seen). 

The second moral is that I should have been more specific in my error handling. When writing the script, it threw an exception when it couldn't find the completed students file. But rather than handling that *specific* exception, I wrote the script to handle *any* exception in the same way.  This allowed a completely different error to be treated the same as the error that I had anticipated, with obviously unfortunate consequences.

No comments: