Wednesday, February 21, 2018

Plotting Crosstabs in Pandas

This post is primarily to remind myself of a solution that worked for me. I was analyzing some request table data from our LMS provider. (The request table is derived from log files.) To get a rough and ready estimate of what students were doing in the LMS over the course of a semester, I wanted to create a contingency table from two of the columns: web_application_controller and web_application_action, which specify how the web request was serviced (an assignment was created, a discussion was viewed, etc.)

(I'm still working on getting code snippets to look good in these posts. Bear with me.)

First, we'll create some fake data by randomly creating 100 request records for each of three students, each consisting of an application and an action:

 import numpy as np  
 import pandas as pd  

 application = ['files', 'discussion', 'assignment', 'quiz']  
 action = ['show', 'index', 'create']  
 names = np.array(['alice', 'bob', 'charlie'])  

 student = np.repeat(names, [100, 100, 100], axis=0)  
 applications = np.random.choice(application, 300, replace=True)  
 actions = np.random.choice(action, 300, replace=True)  

 df = pd.DataFrame({'id':student, 'web_application':applications, 'web_action':actions})  
 df.head()

This gives us the following:

Next, we create a create a table of the count of requests per student, using Pandas crosstab:


 df_cross = pd.crosstab([df.id, df.web_action], df.web_application)
 df_cross

This results in:

This was what I wanted, but I couldn't figure out a good way to plot the results. After lots of searching, I came across an answer (I'll try to track down the original) which involved "flattening" or "tidying" the table, using Pandas melt function (which is just like R's melt). Note that we have to reset the index as part of the process.


 df_cross_melt = pd.melt(df_cross.reset_index(), id_vars=['id', 'web_action'], var_name='web_application', value_name = 'count')
 df_cross_melt.head()

From there, it is possible to use Seaborn's factorplot to visualize the results:

 import matplotlib.pyplot as plt
 import seaborn as sns
 %matplotlib inline

 g = sns.factorplot(x='web_application',
                  y='count',
                  hue='web_action',
                  col='id',
                  kind='bar',
                  data=df_cross_melt,
                  sharey=True)

 g.set_xticklabels(rotation=90)
This produces the following chart:


I'm sure that there are much more efficient ways to do this, but this worked for me and got me the result that I needed.

No comments: