Bill-of-Costs OCR Extraction Pipeline
A major US home-improvement retailer
Overview
A build document-OCR pipeline that extracts structured line-item data from a large, multi-page “Bill of Costs” PDF for a major US home-improvement retailer. It converts PDF pages to images, crops a precise vertical column from each page, and runs OCR to recover the text values.
Why It Exists
Long financial/legal cost documents are delivered as flat PDFs, making the embedded tabular data unusable downstream. This build validated an automated way to pull a specific column of values out of dozens of pages without manual transcription.
What We Built
A focused Python script (do.py) using PaddleOCR. It enumerates the page images, sorts them by their extracted page index, and for each page crops a fixed window (a 214-pixel-wide column starting 173px from the left, full page height) to isolate the relevant figures. Each crop is saved for verification and passed through PaddleOCR’s predict, with the recognized text lines accumulated into a single result set. Supporting folders hold the source PDF, the per-page images, the cropped regions and the output.
Technologies & Approach
PaddleOCR was chosen for solid printed-text recognition with document-orientation and unwarping toggles. Pre-cropping to the column of interest before OCR dramatically reduces noise and improves extraction accuracy versus running OCR on the full page.
Outcome / Impact
Demonstrated a repeatable, automated path from a locked multi-page cost PDF to structured, machine-readable line items, proving feasibility for the client’s document-digitization need.
Capabilities Demonstrated
- OCR-based extraction of tabular values from scanned/exported PDFs
- Deterministic region cropping to boost recognition accuracy
- Batch page-to-image processing and ordered aggregation in Python