Capability · 2025–26

Boolean Query Matching Engine for Article Streams

A media-monitoring / data-orchestration platform

Overview

A high-performance query-matching engine that evaluates incoming articles against roughly 1,500 boolean queries, deployed on Google Cloud Run. It powers which subscriptions a given article belongs to.

The Challenge

Matching every article against ~1,500 complex boolean queries, with phrase, proximity, fuzzy, and multilingual semantics, must be both accurate and fast enough to keep up with a continuous ingestion stream, while remaining explainable.

What We Built

A six-phase matching pipeline: (1) tokenization with position tracking, (2) candidate-query identification via an inverted index, (3) exact / normalized / stemmed matching, (4) a viability check using three-state AST evaluation, (5) fuzzy plus semantic validation, and (6) final boolean evaluation. The codebase includes a custom query parser, AST nodes, a fast stemmer and lightweight inflection handling, content cleaning (ftfy), snippet generation with keyword positions, and a cache builder for corpus/query caches. Language-specific test suites cover English, German, and Arabic matching, plus editorial bonuses and skip rules. Reported throughput is ~37 articles/second (~27ms per article).

Technologies & Approach

Python with spaCy (en_core_web_md), NLTK, RapidFuzz for fuzzy matching, PyStemmer for stemming, ftfy for text repair, and NumPy. The inverted-index candidate step keeps the heavy AST/semantic evaluation off the critical path, and pre-built caches make matching fast. The service is containerized for Cloud Run.

Outcome / Impact

Serves as the platform’s core matching brain, replacing/augmenting the legacy production matcher, its results were validated against the old system via a dedicated comparison dashboard before rollout.

Capabilities Demonstrated

High-throughput boolean query matching at scale
Multi-phase NLP pipeline (tokenize → index → match → AST → semantic)
Multilingual matching (English, German, Arabic) with stemming and fuzzy logic
Explainable matches via snippet and keyword-position generation

More work See all →

Product 2026