← All work
Product · 2023

VAT-Number Extraction & Validation Cache

An SMB accounting/fintech platform

Overview

A data-processing notebook that derives clean, validated VAT numbers from the platform’s OCR-predicted document data and persists them to a local SQLite cache. It focuses on Irish (IE-prefixed) VAT registrations, normalising and deduplicating seller VAT numbers per business.

Why It Exists

OCR extraction produces noisy, inconsistent VAT numbers, stray spaces, partial strings, non-Irish formats. To assign correct VAT codes and pre-fill returns, the platform needs a trustworthy mapping from business to validated VAT number, built and cached from historical document data.

What We Built

A Jupyter notebook (vat_numbers.ipynb) that loads a wide export (latest.csv) with pandas, filters to records updated after a cut-off date, keeps only rows with a non-null business and a plausible predicted_seller_vat_number (length > 7, IE prefix), strips whitespace, and produces a clean business-to-VAT mapping. Results are stored in vat_cache.sqlite for fast lookup by downstream services.

Technologies & Approach

Python with pandas and NumPy for filtering, type-coercion and string normalisation; SQLite as a lightweight persistent cache. The notebook format suited iterative tuning of the validation rules against real platform data.

Outcome / Impact

Produced a reusable, validated VAT-number cache that improved the accuracy of VAT coding and reduced reliance on raw OCR output. Proved out the cleansing rules later applied in the production document and tax pipelines.

Capabilities Demonstrated

  • Cleansing and validating noisy OCR-derived financial identifiers
  • Country-specific tax-number normalisation (Irish VAT)
  • Building lightweight persistent caches to serve downstream automation
More work See all →