Romanian Text-Classification Training Pipeline (spaCy Projects)
An influencer-marketing media-intelligence platform
Overview
The model-training counterpart to the platform’s NLP inference service. It uses spaCy’s Projects workflow to train Romanian-language text-classification (textcat) models from labelled corpora, producing the packaged models that the serving API later loads.
Why It Exists
The product’s automated content categorization depends on trained classifiers. This repository holds the reproducible, config-driven pipeline that builds those models, keeping training assets and configuration versioned and rerunnable.
What We Built
A spaCy v3 Projects setup (project.yml) configured for Romanian (lang: "ro") text classification, with standard directories for assets, corpus, configs, training, scripts, and packages. Training and evaluation data are shipped as JSONL (train.jsonl / eval.jsonl), driven by a config.cfg, with CPU/GPU selection exposed as a variable. The repo includes a Dockerfile, Makefile, run script, .aws config, GitLab CI, and a test_project_textcat_demo.py smoke test. It builds on spaCy’s textcat_docs_issues tutorial pattern, adapted to the product’s labelling scheme.
Technologies & Approach
spaCy 3.0 Projects for reproducible, declarative ML pipelines; JSONL corpora; containerized builds via Docker; and GitLab CI for automation. The Projects format makes each training run deterministic and parameterized through project.yml variables.
Outcome / Impact
Produced the trained Romanian text-classification models consumed by the NLP API service, giving the platform a maintainable, repeatable path from labelled data to deployable classifier, an honest, focused R&D/ML-engineering deliverable.
Capabilities Demonstrated
- Building reproducible NLP training pipelines with spaCy Projects
- Romanian-language text classification (non-English NLP)
- Config- and data-driven model builds with versioned corpora
- Containerized, CI-automated ML training feeding a separate serving service