We've been investigating Canvas Data for a little over a year. (For those who don't know, Canvas is a learning management system created by a company called Instructure.) Due to a slowly evolving infrastructure, our current process done very much "by hand" and consists of downloading the compressed flat files, unzipping them one at a time, and reading them into R and/or Python to do analyses.
Recently, I came across a project called the Canvas Data Loader. This is an unofficial command line tool that will download the Canvas data and then insert the records into a relational database (postgresql or MySQL are supported at the moment). As we are in the process of moving to a more robust infrastructure, this tool looked like it could fit the bill perfectly.
The tool itself is written in Rust (a language that I had never heard of, though apparently it has been around for a while). I'm not a systems admin, so it took a while to get everything installed (postgresql, Microsoft Visual C++ Build Tools, clang (LLVM)) in order to try the process out on a team laptop. After a day of downloading and troubleshooting, I compiled the tool and ran it.
The process has now been running for almost a week with no errors. All of the data files have been downloaded, which is great. However, so far only seven database tables have been created out of over 80 that exist.
Our Canvas instance is relatively old and we are a pretty big school, so many of these tables are well over a million rows. Depending on the number of columns, it looks like we are getting between 5 and 30 rows per second. At five rows per second, one million rows would take 200,000 seconds, not a trivial amount of time.
We know that given the size of our data no process will be instantaneous, but unless we can figure out how to improve performance, I don't think Canvas Data Loader will meet our needs.
Tuesday, April 10, 2018
Subscribe to:
Post Comments (Atom)

No comments:
Post a Comment