Engineering · 2024–25

LLM-Ready Web Crawling & Scraping (Integrated/Extended)

Overview

Firecrawl is an open-source service (by mendableai) that turns entire websites into clean, LLM-ready markdown or structured data. In our work it was integrated, self-hosted and extended as the crawl-and-extract backbone for AI/data pipelines, not authored from scratch.

Why It Exists

Feeding web content to LLMs and RAG systems requires more than raw HTML: it needs crawling, JavaScript rendering, cleaning and conversion to markdown or structured fields. Firecrawl provides that as a service; we adopted and integrated it rather than rebuilding the capability.

What We Built

We worked with the full Firecrawl monorepo, the apps/ services, docker-compose.yaml for self-hosting, the supabase/ data layer, and the documented SELF_HOST.md path, to stand up and integrate a self-managed crawl-for-AI pipeline. The role here is architecting around and extending an OSS platform: deploying it, wiring its crawl/scrape/extract endpoints into downstream workflows, and using its SDK and examples/ as integration references.

Technologies & Approach

A TypeScript/Node API surface with a Python SDK, deployed via Docker Compose with Supabase as the backing store. Chosen because it solves LLM-ready extraction (markdown + structured data) out of the box and is self-hostable, giving full control over data and cost versus a hosted API.

Outcome / Impact

Provided a production-capable, self-hosted crawl-and-extract layer for AI/data work, demonstrating the studio’s ability to evaluate, deploy and integrate substantial open-source infrastructure rather than reinventing it.

Capabilities Demonstrated

Integrating and self-hosting OSS crawl-for-AI infrastructure
Producing LLM-ready / RAG-ready content from the open web
Operating Docker Compose + Supabase service stacks
Selecting and extending best-in-class open source over bespoke builds

More work See all →

Product 2026