Fuzzy Supplier-to-Company Entity Resolution
An SMB accounting/fintech platform
Overview
A Jupyter-based build that reconciles OCR-extracted invoice seller names against a canonical list of known suppliers using fuzzy name matching. It addresses the core data-quality problem in automated bookkeeping: the same vendor appears under dozens of spelling, casing and legal-suffix variants.
Why It Exists
Invoices arrive as scanned documents whose seller names are extracted by OCR and never match the clean supplier registry exactly, “MUSGRAVE”, “Musgrave Ltd”, trailing whitespace, multi-line addresses bleeding into the name field. Without reliable matching, every document needs manual vendor assignment, defeating the point of automation.
What We Built
A notebook (Name Matching.ipynb) that loads a 209-row canonical company list and a larger feed of predicted invoice data with pandas, then applies the name_matching library’s NameMatcher to link extracted seller names to supplier IDs. Working CSVs (209-companies.csv, data.csv) capture the real fields in play, supplier_id, name, OCR-predicted seller/buyer names, VAT codes and Nanonets confidence scores.
Technologies & Approach
Python with pandas for data wrangling and the name_matching package for distance-based string similarity and candidate ranking. The exploratory notebook format let us tune matching thresholds against actual platform data before considering productionisation.
Outcome / Impact
Validated that automated fuzzy matching could collapse noisy OCR vendor names onto canonical supplier records, reducing manual reconciliation. As an R&D notebook it proved the approach and informed the supplier-matching logic carried into the platform’s n8n workflows.
Capabilities Demonstrated
- Entity resolution and record linkage on real, messy financial data
- Practical fuzzy-matching pipeline design in Python/pandas
- Rapid data-science building to de-risk a production feature