The Third Distractor: Large-ish data, or "Where should I put this?"

When I was collecting data for research studies while working on my degree in second language acquisition, the "golden number" was 30. If you could somehow get at least 30 language learners to participate in your research experiment, you were good to go. (This reminds me of the old joke: "How many statisticians does it take to change a light bulb?" "At least 30, but a really good one can get away with as few as 25.") As my focus turned more towards language testing, 30 seemed a little small, and the "rule of thumb" for using Rasch measurement was "at least 100." I had somewhere north of 400 test taker responses for my dissertation work, which I was happy with. When I was working as Assessment Director of a federally-funded language research center, we could normally get thousands of test takers to pilot a commonly taught language like Spanish, with considerably fewer for less commonly taught languages like Hindi or Urdu. For all that time, I never had an issue processing data on whatever desktop (or, increasingly, laptop) computer I happened to be using. (I seem to recall a brief period of time circa 2007 when in Excel would sometimes choke on our biggest data sets, but that easily overcome by upgrading to the most recent version.)

Recently, however, I've been working with data from a learning management system (LMS) that serves over 40,000 students and instructors. A "data dump" of information from the system consists of over 80 informational tables which, for reasons I won't go into here, we only have access to as flat files. Although most of the files are only a few hundred megabytes in size, some are several gigabytes, and one is so large that attempting to decompress it (files are downloaded in a compressed format) resulted in an "out of space" error on a 1TB drive. Even for the "smaller" files, manipulating them in R (our languages of choice) through joins and the like can quickly lead to memory issues. This is not "big data" by any means -- a modern computer with a multi-TB hard drive should be able to hold it all -- but it is large enough to be quite unwieldy for someone like me who has never needed to be concerned with memory-efficient processing or having enough storage space. There seems to be a gap between "big data" solutions (I certainly don't need a Hadoop cluster for a couple TBs worth of data) and "standard-issue" office computers.

The moral of this story is that having a "right-sized" data infrastructure in place is important, and just having years of experience in "data analysis" doesn't necessarily make you fluent in working efficiently with "larger-than-memory" data. There's a lot for me to learn.

The Third Distractor

Friday, February 2, 2018

Large-ish data, or "Where should I put this?"

No comments:

Blog Archive

About Me