← All work
Tooling · 2025

Ingestion Quality Comparison & Validation Dashboard

A media-monitoring / data-orchestration platform

Overview

A validation tool that compares article-to-query matching results between two Cloud SQL databases ingesting the same NewsCatcher source, the original denormalized system and a new normalized ingestion system, to quantify differences in match count and quality before cutover.

Why It Exists

When replacing a production ingestion system, you need objective proof that the new pipeline matches (or beats) the old one. The two systems use different schemas: the original stores a denormalized articles table with a tsvector full-text column and a 1024-dimension vector for semantic search plus a flat article_terms match table, while the new system uses a normalized hits / queries / hit_queries many-to-many model over a query subset. A purpose-built comparison harness was needed to reconcile and grade these results.

What We Built

A Python comparison engine with a Streamlit dashboard. It extracts common queries across both databases (extract_common_queries.py), runs matching comparisons via shared DB utilities, and visualizes differences in both quantity and quality of matches using Pandas and Plotly. Schema definitions for both systems are captured so the tool can map denormalized and normalized representations onto a common basis for apples-to-apples analysis, with a one-command dashboard launcher.

Technologies & Approach

Python with psycopg2 against two Cloud SQL PostgreSQL instances, Pandas for reconciliation, Plotly for visualization, and Streamlit for the interactive UI. The design centers on isolating a shared query set so match differences reflect real ingestion behavior rather than input mismatch.

Outcome / Impact

Gave the team an evidence-based, interactive way to validate the new normalized ingestion system against the incumbent, de-risking the migration by surfacing exactly where and how matching diverged.

Capabilities Demonstrated

  • Building data-validation tooling for system migrations
  • Reconciling denormalized vs. normalized PostgreSQL schemas
  • Quantifying match parity across full-text and semantic search systems
  • Rapid interactive analytics dashboards with Streamlit and Plotly
More work See all →