This is the third (and hopefully last) post chronicling my attempts to extract some data from a PDF. While those posts were focused on the challenges, this post is focused a little more on one of the successes.
To set the stage, I was working with a document that had a chronological list of colleges and universities that had gone to "test optional" for their admissions process. The document was mostly just the list of school names, but occasionally had a date range to the left:
Summer 2011 Bryant University
Earlham College
Nichols College
Anna Maria College
Though this format is quite readable for humans, it is obviously not the most useful format for processing. The me of a few years ago would probably have just saved to text and
then gone through the text file manually to pull out the dates. But the current me knew better. I knew that I should do something to make things easier on myself for downstream processing.
I had a simple idea. I went through the document and prepended the word "DATE:" before each date, then hit return to make sure that the school name was not on the same line. That resulted in, after saving to text, something that looked like this:
DATE:Summer 2011
Bryant University
Earlham College
Nichols College
Anna Maria College
From there, I could use a simple Python script to loop through each line, ignore blank lines, identify and extract the dates, and create a dataframe that had dates and schools for every row:
cur_date = ''
results = []
for element in data:
element = element.strip()
if not element:
continue
if element.startswith("DATE:"):
new_date = element.split("DATE:",1)[1]
if (cur_date != new_date):
cur_date = new_date
continue
results.append((cur_date,element))
# make dataframe
df = pd.DataFrame(results, columns =['Date_Range', 'Schools'])
Once I had the data in a dataframe, grouping by date and plotting the count of schools became quite easy.
The moral of this story is that even if you are doing something partially manually, why not add some proactive tweaking to make things easier on yourself down the line. Though this process may not have saved time in terms of making this particular chart (I could have just counted the number of schools in each time period and tracked the results in Excel), I don't consider it a wasted effort. I now have the list of the schools themselves along with the dates for each school in a machine-friendly format for future use, if needed.

No comments:
Post a Comment