Web Content Scraping Service with Transform-Based Caching
A media-monitoring / data-orchestration platform
Overview
A FastAPI web-content scraping service that fetches URLs (HTML pages and PDFs) via a browser-based scraper, runs configurable transforms over the content, and caches results in Redis for fast reuse.
The Challenge
A media-monitoring pipeline frequently needs the full text behind article URLs, including bot-protected pages and PDFs. Re-fetching the same content repeatedly is slow and costly, so the platform needs a cache-first fetch layer that can also normalize content through reusable transforms.
What We Built
A FastAPI service exposing /warm (enqueue + pre-compute transforms) and /fetch (same, but returns transform outputs and auto-enqueues unknown URLs) plus a health endpoint. A browser-based scraper integration (Scrappey) handles difficult pages, PyMuPDF extracts text from PDFs, and a transform registry applies content transforms whose outputs are cached in Redis. The app is cleanly structured into scraper, transforms, worker, models, and Redis-client modules and ships with a Dockerfile and compose setup.
Technologies & Approach
Python with FastAPI and Pydantic for typed request/response handling, Redis (with hiredis) for the content/transform cache, httpx for HTTP, and PyMuPDF for PDF parsing. A registry pattern keeps transforms pluggable; warm-vs-fetch semantics separate pre-computation from retrieval.
Outcome / Impact
Provides the pipeline a fast, deduplicated content-fetch layer so downstream enrichment and matching operate on cleaned text without redundant scraping.
Capabilities Demonstrated
- Cache-first scraping architecture (warm vs. fetch)
- Browser-based scraping for protected pages
- Pluggable transform registry over fetched content
- PDF text extraction in the ingestion path