High-Performance Boolean Query Matching Engine
A media-monitoring / data-orchestration platform
Overview
A production-grade engine that matches incoming articles against thousands of boolean search queries with NLP-aware optimizations. The system is documented at ~37 articles/second throughput while evaluating 1,502 queries per article, and also pinpoints exact keyword positions within matched content.
The Challenge
Media monitoring means testing every incoming article against a large, evolving set of boolean queries (AND/OR/NOT phrase logic), and doing it fast enough to keep up with ingestion. Naive per-query evaluation is far too slow at this scale, so the matcher needs candidate filtering, caching, and careful tokenization to stay both accurate and performant.
What We Built
A Python matching system organized around a query parser and AST node model, a phrase-matching core, a cache builder that precomputes inverted-index and corpus caches, and a batch processor for large CSV inputs. It evaluates boolean queries against article text using spaCy/NLTK for tokenization and lemmatization and RapidFuzz for fuzzy comparison, and includes a keyword-position finder that reports where terms appear in content (feeding the platform’s hit_keyword_positions model). Extensive tooling supports the engine: a benchmark harness, false-positive export, accuracy-parity analysis against the legacy system, corpus/query generators, and a GCP deployment package (Cloud Build config, Dockerfile, deploy scripts, plus a separate keyword-service). Documentation covers a developer guide, implementation summary, and the keyword-position feature.
Technologies & Approach
Python with spaCy (en_core_web_sm), NLTK, RapidFuzz, and NumPy. The core optimization is candidate filtering via prebuilt inverted-index caches so only plausibly-matching queries are fully evaluated per article, with a dedicated cache-build step. Deployment targets Google Cloud via Cloud Build and Docker, with an accuracy-parity suite ensuring the optimized engine matches the reference system’s results.
Outcome / Impact
Provides the platform’s core matching capability at production throughput, with built-in benchmarking and accuracy-parity validation to guarantee the optimized path stays correct as it scales.
Capabilities Demonstrated
- Building high-throughput text-matching engines (documented 37 articles/sec over 1,500+ queries each)
- Parsing and evaluating boolean queries via an AST with NLP tokenization
- Inverted-index candidate filtering and cache precomputation for performance
- Keyword-position extraction for explainable matches
- Accuracy-parity testing and benchmarking against a legacy system
- Packaging and deploying matching services on Google Cloud