Boolean Query Matching Engine for Article Streams
A media-monitoring / data-orchestration platform
Overview
A high-performance query-matching engine that evaluates incoming articles against roughly 1,500 boolean queries, deployed on Google Cloud Run. It powers which subscriptions a given article belongs to.
The Challenge
Matching every article against ~1,500 complex boolean queries, with phrase, proximity, fuzzy, and multilingual semantics, must be both accurate and fast enough to keep up with a continuous ingestion stream, while remaining explainable.
What We Built
A six-phase matching pipeline: (1) tokenization with position tracking, (2) candidate-query identification via an inverted index, (3) exact / normalized / stemmed matching, (4) a viability check using three-state AST evaluation, (5) fuzzy plus semantic validation, and (6) final boolean evaluation. The codebase includes a custom query parser, AST nodes, a fast stemmer and lightweight inflection handling, content cleaning (ftfy), snippet generation with keyword positions, and a cache builder for corpus/query caches. Language-specific test suites cover English, German, and Arabic matching, plus editorial bonuses and skip rules. Reported throughput is ~37 articles/second (~27ms per article).
Technologies & Approach
Python with spaCy (en_core_web_md), NLTK, RapidFuzz for fuzzy matching, PyStemmer for stemming, ftfy for text repair, and NumPy. The inverted-index candidate step keeps the heavy AST/semantic evaluation off the critical path, and pre-built caches make matching fast. The service is containerized for Cloud Run.
Outcome / Impact
Serves as the platform’s core matching brain, replacing/augmenting the legacy production matcher, its results were validated against the old system via a dedicated comparison dashboard before rollout.
Capabilities Demonstrated
- High-throughput boolean query matching at scale
- Multi-phase NLP pipeline (tokenize → index → match → AST → semantic)
- Multilingual matching (English, German, Arabic) with stemming and fuzzy logic
- Explainable matches via snippet and keyword-position generation