Friday, February 23, 2018

Plotting in Pandas: A Cautionary Tale

OK, so we're actually plotting via matplotlib, but that lacked the alliteration needed for a decent title.

This is a follow-on post to one a couple of days ago about plotting some "log"-ish information from our LMS. The idea is to visualize the types of action (what the learning is doing) / application (which part of the LMS the learner is using) patterns for several different learners. This time, I was trying to illustrate the relative frequency different patterns across students.

We'll start by creating some fake data again. We have five learners, some of whom have performed a random subset of actions across a subset of applications. (I'm sure there are much more compact ways of generating fake data, but I'm the kind of person that needs to think it through step-by-step.)


import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
 
application = ['files', 'discussion', 'assignment', 'quiz', 'wiki', 'announcement']
action = ['show', 'index', 'create', 'backup']
names = np.array(['alice', 'bob', 'charlie','dorothy', 'ellen']) 

student = np.repeat(names, [100, 100, 100, 100, 100], axis=0)
applications1 = np.random.choice(application[:4], 300, replace=True) # subset
applications2 = np.random.choice(application, 200, replace=True) # all
actions1 = np.random.choice(action[:4], 300, replace=True) # subset
actions2 = np.random.choice(action, 200, replace=True) # all 

applications = np.concatenate((applications1,applications2))
actions = np.concatenate((actions1,actions2)) 

df = pd.DataFrame({'id':student, 'web_application':applications, 'web_action':actions})
df.head()

This gives us a data frame with learners, actions, and applications.


We can do some aggregation in pandas to make a chart of the frequency of each action/application combination for each learner. We'll count and sort the values as part of the process, saving each resulting aggregation to a list. At the end, we'll plot one of the list elements (the results for 'charlie') to make sure things look as expected.

 # get the list per student
names = ['alice', 'bob', 'charlie','dorothy', 'ellen']

df_list = []
for name in names:
    mytemp = df[df['id'] == name].groupby(['web_action','web_application'])['web_action'].count().sort_values()
    mytemp.name = name
    df_list.append(mytemp)

df_list[2].plot(kind='barh', title=df_list[2].name)

This gives us:


Not bad. We could repeat the process for each of the elements in the list to get a plot for all of the students if we wanted. But wouldn't it be nice to all of the students together in one plot? That's what I thought, and that's where the cautionary tale part of the story comes in.

I haven't done a whole lot of plotting in Python, but working out some of the initial kinks of properly  specifying subplots, I managed to create a single plot with the information that I wanted. Unfortunately, the plot seemed a little busy in terms of labeling.
fig,axs = plt.subplots(3,2, figsize=(20,15))

for i in range(0, len(df_list)):
    df_list[i].plot(kind='barh', title=df_list[i].name, ax=axs.flat[i])




Wouldn't it be nice if we could clear up some of the clutter by getting rid of the y axis label and having the plots share the tick labels? That led to my next plot.


fig,axs = plt.subplots(3,2, sharey = True, figsize=(20,15))

for i in range(0,len(df_list)):
    df_list[i].plot(kind='barh', title=df_list[i].name, ax=axs.flat[i]).yaxis.label.set_visible(False)


Much less cluttered, which is good. Unfortunately, sharing the y axis tick labels makes the chart, well, wrong. Although a superficial glance makes it look as though things are working --some students had fewer possible application/action combinations, so there are the expected gaps for the expected students-- closer inspection reveals that sharing the y labels has led to mislabeled information for the majority of the subplots. Things are aesthetically pleasing, but not correct because the data are not in the same order across the students. (I know, I know...duh!)

Rather than putz around some more, I decided to revert back to the previous version. The moral of the story is that oft repeated dictum that computers are stupid and they will do what you tell them to do (even if you told them to do something stupid), so always double-check.

No comments: