LLM-Ready Web Crawling & Scraping (Integrated/Extended)
Overview
Firecrawl is an open-source service (by mendableai) that turns entire websites into clean, LLM-ready markdown or structured data. In our work it was integrated, self-hosted and extended as the crawl-and-extract backbone for AI/data pipelines, not authored from scratch.
Why It Exists
Feeding web content to LLMs and RAG systems requires more than raw HTML: it needs crawling, JavaScript rendering, cleaning and conversion to markdown or structured fields. Firecrawl provides that as a service; we adopted and integrated it rather than rebuilding the capability.
What We Built
We worked with the full Firecrawl monorepo, the apps/ services, docker-compose.yaml for self-hosting, the supabase/ data layer, and the documented SELF_HOST.md path, to stand up and integrate a self-managed crawl-for-AI pipeline. The role here is architecting around and extending an OSS platform: deploying it, wiring its crawl/scrape/extract endpoints into downstream workflows, and using its SDK and examples/ as integration references.
Technologies & Approach
A TypeScript/Node API surface with a Python SDK, deployed via Docker Compose with Supabase as the backing store. Chosen because it solves LLM-ready extraction (markdown + structured data) out of the box and is self-hostable, giving full control over data and cost versus a hosted API.
Outcome / Impact
Provided a production-capable, self-hosted crawl-and-extract layer for AI/data work, demonstrating the studio’s ability to evaluate, deploy and integrate substantial open-source infrastructure rather than reinventing it.
Capabilities Demonstrated
- Integrating and self-hosting OSS crawl-for-AI infrastructure
- Producing LLM-ready / RAG-ready content from the open web
- Operating Docker Compose + Supabase service stacks
- Selecting and extending best-in-class open source over bespoke builds