Colarusso: Sample Notebook for Extracting Data from OCRed PDFs Using Regex and LLMs

One can use this notebook to build a pipeline to parse and extract data from OCRed PDF files. Warning: When using LLMs for entity extraction, be sure to perform extensive quality control. They are very susceptible to distracting language (latching on to text that sound “kind of like” what you’re looking for) and missing language (making up content to fill any holes), and importantly, they do NOT provide any hints to when they may be erroring. You need to make sure random audits are part of your workflow! Below we’ve worked out a workflow using regular expressions and LLMs to parse data from zoning board orders, but the process is generalizable.

Collect a set of PDFs
Place OCRed PDFs into the data folder
Write regexes to pull out data
Write LLM prompts to pull out data

https://github.com/colarusso/entity_extraction/blob/main/PDF%20Entity%20Extraction%20with%20Regex%20and%20LLMs.ipynb

A Jupyter notebook to extract data from PDFs. Useful stuff

How the US CCA generate PDFs

A real quick look at PDFs from the Circuit Courts reveals how the PDFs they release are generated.

  • 1st – Corel WordPerfect
  • 2nd – Microsoft Word
  • 3rd – Microsoft Word
  • 4th – Microsoft Word
  • 5th – Corel WordPerfect
  • 6th – ? not obvious from PDF meta data
  • 7th – Microsoft Word
  • 8th – Corel WordPerfect
  • 9th – Corel WordPerfect
  • 10th – Microsoft Word
  • 11th – Microsoft Word
  • DC – Adobe
  • Fed – Adobe

Information was obtained by simply downloading a recent PDF and looking at the properties of the file for creation information.

 

Please Make PDFs Go Away…

Most law firms have a history of using Adobe’s Portable Document Format (PDF) to distribute their brochures, papers and longer written pieces. That practice matches what web usability experts have long advised: “PDF is great for distributing documents that need to be printed,” but not much more than that. The well-traveled rule is that if a document contains more than five pages of text (hint: that excludes lawyer profiles), then PDF format is worth considering.

Now, let’s throw a wrench into this. As we approach the end of 2011, many firms and their their clients are moving toward paperless offices. Clients are consuming law firm publications on a variety of devices, including smartphones, tablets, e-readers, and large multiple-monitor desktop environments. So how likely is it that we consume a PDF on printed paper? Not very.

Slaw – Revisiting PDFs for Law Firm Websites & Mobile Publishing

Finally someone has something useful to say about the future of PDFs. As someone who has to deal with found PDFs from all over the web, I can honestly say I wouldn’t miss them if they disappeared tomorrow. PDF is an excellent way to capture the artifact of the document page, but a PDF is not a web page, and PDF is not open data. PDF is a photocopy, a snapshot picture of a document. If you are interested in doing things like indexing data, repurposing data, reusing data, then a PDF is pretty useless.