Multi-Threaded Term-Mapping & Keyword Extraction Engine
A media-monitoring / data-orchestration platform
Overview
A Python library and Cloud Function that maps articles to their matching query terms and extracts keywords, engineered for high throughput via multi-threading and algorithmic optimization. Its implementation summary documents a complete threading-and-optimization system of roughly 7,500 lines across source, tests, and docs.
The Challenge
Resolving which of many boolean queries an article satisfies, and extracting the relevant terms, is computationally heavy at ingestion scale. Doing it fast requires concurrency that avoids lock contention plus algorithms that reject non-matches cheaply, while staying accurate against production data.
What We Built
A term-mapping engine with several layers: a query parser and predicate extractor that turn boolean query text into evaluable structures; a keyword extractor (with a fallback path) over NLTK; and an optimization stack combining an inverted-index optimizer (documented ~24x speedup via candidate filtering) and a Bloom-filter optimizer (~10x faster rejection). A lock-free thread-local caching layer and a multi-threaded query engine drive concurrent article processing, configurable through an optimization-config module. The system ships as a Google Cloud Function (functions-framework/Flask) backed by PostgreSQL, with an article-terms data layer, SQL database optimizations, and an extensive validation suite, threading tests, production-data analysis, and end-to-end comparison tests with metrics. Deep documentation (final summary, logging guide, optimization summaries, production-test runbook) accompanies the code.
Technologies & Approach
Python with NLTK for language processing, deployed on Cloud Functions via functions-framework, persisting to PostgreSQL through psycopg2. The performance strategy layers two complementary filters, Bloom filters to cheaply reject impossible matches and an inverted index to shortlist candidates, under a lock-free threading model so the heavy evaluation only runs where it can succeed.
Outcome / Impact
Delivered a production-ready, heavily-tested term-mapping component with documented multi-fold speedups and validation against real production data, strengthening the platform’s article-to-query matching pipeline.
Capabilities Demonstrated
- Designing lock-free, thread-local concurrent processing in Python
- Algorithmic optimization with inverted indexes and Bloom filters (documented 24x / 10x gains)
- Boolean query predicate extraction and keyword extraction with fallbacks
- Throughput benchmarking and production-data validation suites
- Packaging optimized NLP workloads as Google Cloud Functions over PostgreSQL