pdf – <CONTENT /> v.6

June 21, 2023June 21, 2023

Colarusso: Sample Notebook for Extracting Data from OCRed PDFs Using Regex and LLMs

One can use this notebook to build a pipeline to parse and extract data from OCRed PDF files. Warning: When using LLMs for entity extraction, be sure to perform extensive quality control. They are very susceptible to distracting language (latching on to text that sound “kind of like” what you’re looking for) and missing language (making up content to fill any holes), and importantly, they do NOT provide any hints to when they may be erroring. You need to make sure random audits are part of your workflow! Below we’ve worked out a workflow using regular expressions and LLMs to parse data from zoning board orders, but the process is generalizable.

Collect a set of PDFs
Place OCRed PDFs into the data folder
Write regexes to pull out data
Write LLM prompts to pull out data
https://github.com/colarusso/entity_extraction/blob/main/PDF%20Entity%20Extraction%20with%20Regex%20and%20LLMs.ipynb

A Jupyter notebook to extract data from PDFs. Useful stuff

November 9, 2020November 9, 2020

GitHub – themsaid/ibis: A PHP tool that helps you write eBooks in markdown and convert to PDF.

GitHub – themsaid/ibis: A PHP tool that helps you write eBooks in markdown and convert to PDF. https://github.com/themsaid/ibis

October 24, 2015

vgel.me – Getting a full PDF from a DRM-encumbered online textbook

vgel.me – Getting a full PDF from a DRM-encumbered online textbook – http://vgel.me/posts/cracking-online-textbook/

July 15, 2014July 15, 2014

Upload – PDFy – Instant PDF Host

Why does PDFy exist? I got sick of documents getting locked up behind login walls of services like Scribd. PDFy exists to offer a place where anybody can instantly upload and share a PDF, much like Imgur does for images. PDFy is free, ad-free, and non-commercial.

via Upload – PDFy – Instant PDF Host.

If you’re interested in running your own the code is on Github at https://github.com/joepie91/pdfy.

January 29, 2014

How the US CCA generate PDFs

A real quick look at PDFs from the Circuit Courts reveals how the PDFs they release are generated.

1st – Corel WordPerfect
2nd – Microsoft Word
3rd – Microsoft Word
4th – Microsoft Word
5th – Corel WordPerfect
6th – ? not obvious from PDF meta data
7th – Microsoft Word
8th – Corel WordPerfect
9th – Corel WordPerfect
10th – Microsoft Word
11th – Microsoft Word
DC – Adobe
Fed – Adobe

Information was obtained by simply downloading a recent PDF and looking at the properties of the file for creation information.

November 16, 2011

Please Make PDFs Go Away…

Most law firms have a history of using Adobe’s Portable Document Format (PDF) to distribute their brochures, papers and longer written pieces. That practice matches what web usability experts have long advised: “PDF is great for distributing documents that need to be printed,” but not much more than that. The well-traveled rule is that if a document contains more than five pages of text (hint: that excludes lawyer profiles), then PDF format is worth considering.

Now, let’s throw a wrench into this. As we approach the end of 2011, many firms and their their clients are moving toward paperless offices. Clients are consuming law firm publications on a variety of devices, including smartphones, tablets, e-readers, and large multiple-monitor desktop environments. So how likely is it that we consume a PDF on printed paper? Not very.

Slaw – Revisiting PDFs for Law Firm Websites & Mobile Publishing

Finally someone has something useful to say about the future of PDFs. As someone who has to deal with found PDFs from all over the web, I can honestly say I wouldn’t miss them if they disappeared tomorrow. PDF is an excellent way to capture the artifact of the document page, but a PDF is not a web page, and PDF is not open data. PDF is a photocopy, a snapshot picture of a document. If you are interested in doing things like indexing data, repurposing data, reusing data, then a PDF is pretty useless.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31