According to Wikipedia, regex (short for "regular expression") is "a sequence of characters that specifies a search pattern." There's an old regex joke in coding circles that goes something like this:
You have a problem. You are trying to extract specific information from a string. So you say to yourself, "I know, I'll use regular expressions." Now you have two problems.
This is one of those jokes which makes you laugh and wince at the same time. It's funny because it's true. Here's my story.
The other day, I came across a chronological list of colleges and universities that have moved to a "test optional" application process. ("Test optional" means that students are not required to submit e.g., SAT or ACT scores as part of their application, a policy which is not new, but became supercharged as Covid-19 wreaked havoc on standardized testing sites and schedules.) Because the list was in PDF format, I thought it might be useful to parse the data to something more amenable to manipulation, such as csv, so that the increase in schools over time could be more easily counted or even plotted.
The first challenge was trying to initially extract data from the PDF (which is about 17 pages long). Based on some Googling and Stack Overflow posts, I first looked into a Python package called PyPDF2, but that didn't seem to work for me. I knew that I didn't want to spend to much time on this (famous last words), so I tried just exporting to Excel from Adobe Acrobat, but that wasn't much better. I also tried copy/pasting from the "edit" view in Acrobat, but that was also quite frustrating and I quickly gave up. (Seventeen pages is not trivial to cut and paste, and my wrist started complaining pretty quickly.)
I went back to more pressing tasks for a bit, but had that gnawing desire in the back of my mind to not admit defeat yet. The next time I had a break, I decided to search for alternative Python package and came across another post that had some sample code to extract text from a PDF using the "SimplePDFViewer" module from the pdfreader package. I tweaked the sample code, pointed it at the PDF, and got promising results. Not perfect but any stretch of the imagination (breaks in strange places, empty lines, only one page parsed, etc.), but promising.
The format of the document is basically a list of colleges (one per line) with occasional date ranges to the left of the names. Most entries also contain some kind of parenthetical material about the details of the test optional process (e.g., "Alfred University (for high school classes of 2021, 2022 and 2023)").
It took several tries to rewrite the sample code to actually read all seventeen pages of the document and extract the text from each page. I found that using spaces and ignoring the parentheses via regex
(re.split('\([^()]*\)|\s\s+', page)
tended to do a relatively decent job of splitting the schools on each page, allowing me to go from something that looked like a single string (to the computer) after the initial extract:
' Union College Franklin & Marshall College (extend optional policy to all) King’s College Nazareth College Mitchell College Lake Forest College Salisbury University (3.5 GPA) Summer 2006 Hobart & William Smith Colleges Providence College Spring 2006 Bennington College Gustavus Adolphus College George Mason University (3.5 GPA) Lebanon Valley College Fall 2005 Knox College Drew University Chatham College Spring 2005 Lawrence University The College of the Holy Cross St. Lawrence University Winter 2004-2005 Susquehanna University (“Write Option” for all) Sarah Lawrence College '
to something that looked like this (a Python list):
['', 'Union College', 'Franklin & Marshall College', '', 'King’s College', 'Nazareth College', 'Mitchell College', 'Lake Forest College', 'Salisbury University', '', 'Summer 2006', 'Hobart & William Smith Colleges', 'Providence College', 'Spring 2006', 'Bennington College', 'Gustavus Adolphus College', 'George Mason University', '', '', 'Lebanon Valley College', 'Fall 2005', 'Knox College', 'Drew University', 'Chatham College', 'Spring 2005', 'Lawrence University', 'The College of the Holy Cross', 'St. Lawrence University', 'Winter 2004-2005', 'Susquehanna University', '', 'Sarah Lawrence College', '']
But there were still empty strings in the list, as well as elements that didn't parse nearly as well, leading to some multiple entries like: "Illinois State University Illinois Wesleyan University" as a single line in the final processed file.
If regex helped the first time, then of course the answer is "more regex"! So I decided to rewrite the program to read back in the initial processed output, remove blank lines (which worked), and use "regex magic" to split those long, confusing entries into individual schools (which didn't).
At first, things seemed to be going smoothly. I read the file line by line, ignoring the blank lines, and looked for lines that had more than once instance of the word "University" or "College" to identify lines with multiple entries. If no multiple entries were found, the line was saved back to a file. If a multiple entry was found, I would try to parse the line using regex:
([\w\s]+(University|College))\s([\w\s]+(University|College))
which could then extract each of the school names as a regex group before saving. This worked wonderfully, until it didn't. I quickly realized that some schools have "Institute" in the name, though adding "Institute" to the regex was not terribly challenging:
([\w\s]+(University|College|Institute))\s([\w\s]+(University|College|Institute))
The bigger problem was three-fold: The first was that, while most of the troublesome lines contained two entries, some contained many more (like this humdinger: "Indiana Tech Indiana University East Indiana University Kokomo Indiana University Northwest Indiana University Southeast Indiana University South Bend Indiana University of Pennsylvania Indiana Wesleyan University Iowa State University"). I haven't yet been able to successfully modify my query to match "one or more" additional schools on the same line, though I know it is possible in theory (and probably even trivial for someone who works with regex daily).
The second problem was the variation in naming conventions all of which the regex pattern would need to account for. Here are just a couple:
- <something> College (e.g., Swarthmore College)
- <something>University (e.g., Harvard University)
- <something> Institute (e.g., Rensselaer Polytechnic Institute)
- <something> Institute of <something> (e.g., Massachusetts Institute of Technology)
- University of <something> (e.g., University of Hawai`i Hilo)
- College of <something> (e.g., College of Wooster)
- <something> University <something> (e.g., Loyola University Chicago)
(And don't forget the "false friends" like the string "University of Maryland College Park", which is a single place and not the name of two distinct schools, unlike the "Illinois State University Illinois Wesleyan University" string earlier.)
The third problem was the not insignificant number of typos and similar in the document. This was usually in the form of mismatched brackets (e.g., opening parentheses, but closing brackets; brackets rather than parentheses; doubled opening or closing parentheses; square rather than curly brackets), but such things all would need to be accounted for in the regex. The difference between "Caldwell University {Three-year pilot)" and "Caldwell University (Three-year pilot)" may be trivial and inconsequential for the human reader, but is a world of difference for an automatic parser.
So, although I had some moments of success along the way, I realized that regex is not going to save the day this time. Maybe this kind of thing is trivial for a regex guru, but I could feel myself getting dragged deeper and deeper into the regex rabbit hole. (There is some email validation regex that runs over 6,000 characters in length! Dealing with all of the edge cases get get pretty hairy.)
At the end of the day, I gave up, exported the PDF to Word, manually added carriage returns to each date range along with a prepended "DATE:" designation, and copy/pasted everything to text document which I then saved, after using Notepad++ to remove all of the tabs. Maybe tomorrow I'll feel like trying to do something with the file, but regex beat me up pretty good today.

No comments:
Post a Comment