PDF Table-to-CSV Extraction Spike
Overview
A minimal Python spike that extracts tabular data from a PDF and writes it out as CSV using the tabula library. A focused proof of a single capability: turning PDF tables into structured rows.
Why It Exists
PDFs are a common but awkward source for tabular data. This spike validated tabula-py as a quick path for pulling tables out of PDF documents into a machine-readable CSV, ahead of building larger document-processing flows.
What We Built
A single do.py script that calls tabula.convert_into("test.pdf", "output.csv", output_format="csv"), with a sample input PDF and its produced CSV output checked in. Deliberately small, one file, one job.
Technologies & Approach
Python with tabula-py (a wrapper over the Tabula Java engine) for table detection and extraction. Chosen for being the fastest way to evaluate PDF table extraction without writing custom parsing logic.
Outcome / Impact
Proved that tabular PDF content can be reliably converted to CSV with minimal code, informing later, more substantial document-processing work in the studio’s Documents & PDF capability.
Capabilities Demonstrated
- Extracting tables from PDF documents into structured CSV
- Rapid library evaluation / technical de-risking