This is a follow-up to a recent post about regex. The context is trying to take data from a PDF that presented a chronological list of colleges and universities that have adopted a test optional admissions process. As detailed in the previous post, I had tried (and eventually given up trying) to extract data directly from the PDF via a Python program and finally broke down and cut and pasted the contents into a text file from a Word doc that the PDF had been exported to using Adobe Acrobat.
Now that I had a text file (and a good night's sleep to forget all about regex), I thought that it should be relatively easy to finish up the process. Determined to use Python for at least part of the process, I created a small script to read in the text file that I had created previously, initially printing the results to the screen.
At first, I noticed that there were a number of blank lines when reading in the file, but I figured I could get rid of those easily enough before saving. But as I looked more carefully at the lines, I noticed the same issue I had run into when reading directly from PDF -- multiple entries on the same line. Huh? How could that happen?
I don't know for sure, but my best guess is that, whatever the process used to create the original file, not everything that looked like a separate line was actually a separate line. In other words, there was not a line break after everything that looked like a line. (You've probably seen the same behavior at the page level in Word, where things will flow to the next page automatically. You may have even added some additional spaces at the end of a sentence near the bottom of a page to "help" that process along.....)
The solution was to go back to the exported Word doc (which I had saved, luckily) and, rather than cut and paste, try to save directly to text. Lo and behold, a dialog box popped up asking if I would like to add line breaks. Yes, please!
With that done, the script correctly read the lines as expected, and I could get on to the business of manipulating the data (which I'll describe in the next post).
The moral of the story? Just because something "looks" like columns or lines or whatever in a document, that doesn't mean that it will behave as such programmatically. Be careful when cutting and pasting!

No comments:
Post a Comment