Infrastructure · 2021

Romanian Text-Classification Training Pipeline (spaCy Projects)

An influencer-marketing media-intelligence platform

Overview

The model-training counterpart to the platform’s NLP inference service. It uses spaCy’s Projects workflow to train Romanian-language text-classification (textcat) models from labelled corpora, producing the packaged models that the serving API later loads.

Why It Exists

The product’s automated content categorization depends on trained classifiers. This repository holds the reproducible, config-driven pipeline that builds those models, keeping training assets and configuration versioned and rerunnable.

What We Built

A spaCy v3 Projects setup (project.yml) configured for Romanian (lang: "ro") text classification, with standard directories for assets, corpus, configs, training, scripts, and packages. Training and evaluation data are shipped as JSONL (train.jsonl / eval.jsonl), driven by a config.cfg, with CPU/GPU selection exposed as a variable. The repo includes a Dockerfile, Makefile, run script, .aws config, GitLab CI, and a test_project_textcat_demo.py smoke test. It builds on spaCy’s textcat_docs_issues tutorial pattern, adapted to the product’s labelling scheme.

Technologies & Approach

spaCy 3.0 Projects for reproducible, declarative ML pipelines; JSONL corpora; containerized builds via Docker; and GitLab CI for automation. The Projects format makes each training run deterministic and parameterized through project.yml variables.

Outcome / Impact

Produced the trained Romanian text-classification models consumed by the NLP API service, giving the platform a maintainable, repeatable path from labelled data to deployable classifier, an honest, focused R&D/ML-engineering deliverable.

Capabilities Demonstrated

Building reproducible NLP training pipelines with spaCy Projects
Romanian-language text classification (non-English NLP)
Config- and data-driven model builds with versioned corpora
Containerized, CI-automated ML training feeding a separate serving service

More work See all →

Product 2026