CAEN Business-Registry Code Dataset & Normalization Utility
A leading Romanian retail bank
Why It Exists
Onboarding and advising business/SME customers requires classifying each by their CAEN code, Romania’s official business-activity nomenclature. The bank’s lending and advisory tools need a clean, lookup-ready CAEN dataset rather than the raw, inconsistently formatted source list.
What We Built
A small data-preparation utility. A source dataset (Coduri_CAEN.json, ~96 KB) is processed by process.py, which iterates the entries, strips noise (e.g. the literal "Cod CAEN " prefix) and emits a normalised { titlu, cod } structure as filtered_coduri.json (~51 KB), a compact title/code lookup ready to drop into product flows for business-activity selection and classification.
Technologies & Approach
Plain Python with the standard-library json module, deliberately minimal. The value is the cleaned, structured reference dataset and a repeatable script to regenerate it, feeding business-registry lookups in the SME credit and financial-advisory experiences.
Outcome / Impact
Turned a messy public CAEN list into a clean, ready-to-use code/title dataset that supports business-activity classification in the bank’s SME-facing lending and advisory tools.
Capabilities Demonstrated
- Normalizing public reference data into product-ready lookups
- Business-registry (CAEN) classification support for SME banking
- Lightweight, repeatable ETL with zero dependencies