Infrastructure · 2025

Multi-Source Data Collection Platform on Google Cloud Run

A media-monitoring / data-orchestration platform

Overview

A cloud-native data collection platform that runs a fleet of source-specific collectors on Google Cloud Run, backed by managed PostgreSQL and Redis. It pulls content from a wide range of news, social, and media-intelligence sources behind a single configurable service, with Terraform-managed infrastructure and automated deployments.

The Challenge

Aggregating signal from many heterogeneous data sources, each with its own API, auth, and rate limits, usually sprawls into a mess of one-off scripts. The platform needed a uniform way to add sources, deploy them independently, respect per-source rate limits, and run reliably on serverless infrastructure without a standing fleet of VMs.

What We Built

A single TypeScript/Express codebase that boots into a specific collector via a SERVICE_TYPE environment variable, with dedicated Dockerfiles and run targets for each source: NewsCatcher, Podchaser, X/Twitter, Truth Social, LexisNexis, LinkedIn Search, Google, and a generic X service. The src tree is cleanly separated into handlers, services, db-service, models, a shared rate-limiter, dependency-injected wiring (tsyringe), and Sequelize migrations/seeders. Third-party scraping is brokered through Apify, structured logging runs on Pino, and the whole thing is provisioned by Terraform (main.tf, cloud-tasks.tf, environment configs) including a separate Pub/Sub stack for NewsCatcher ingestion. Deployment is handled by shell scripts and GitHub Actions workflows for Cloud Run.

Technologies & Approach

TypeScript + Express on Cloud Run, Sequelize over PostgreSQL 17, Redis (ioredis) for rate-limiting and coordination, and Cloud Pub/Sub + Cloud Tasks for async/event-driven flows. Infrastructure is fully described in Terraform with bootstrap scripts and per-environment configuration; migrations run in CI pipelines rather than in containers, keeping production images lean.

Outcome / Impact

The platform turned a class of fragile per-source scrapers into a uniform, independently deployable collector fleet with shared persistence, rate-limiting, and observability. Adding a new source becomes a matter of a new handler and Dockerfile rather than new infrastructure, and the Terraform-first approach makes spinning up a fresh dev or prod environment a documented, repeatable process.

Capabilities Demonstrated

Designing serverless, source-pluggable data ingestion on Google Cloud Run
Infrastructure-as-code with Terraform across multiple GCP services (Cloud Run, Pub/Sub, Cloud Tasks, Redis)
Shared rate-limiting and persistence layers across many third-party APIs
CI/CD-first delivery with migration pipelines and per-environment config

More work See all →

Product 2026