eifachposte

eifachposte

We’re a research group that collects data from hundreds of websites regularly. Maintaining individual scrapers was killing us. Every site redesign broke something, every new site was another script from scratch, every config change meant editing files one by one. We built ScrapAI to fix this. You describe what you want to scrape, an AI agent analyzes the site, writes extraction rules, tests on a few pages, and saves a JSON config to a database. After that it’s just Scrapy. No AI at runtime, no per-page LLM calls. The AI cost is per website (~$1-3 with Sonnet 4.5), not per page. A few things that might be relevant to this sub: Cloudflare : We use CloakBrowser (open source, C++ level stealth patches, 0.9 reCAPTCHA v3 score) to solve the challenge once, cache the session cookies, kill the browser, then do everything with normal HTTP requests. Browser pops back up every ~10 minutes to refresh cookies. 1,000 pages on a Cloudflare site in ~8 minutes vs 2+ hours keeping a browser open per request. Smart proxy escalation : Starts direct. If you get 403/429, retries through a proxy and remembers that domain next time. No config needed per spider. Fleet management: Spiders are database rows, not files. Changing a setting across 200 scrapers is a SQL query. Health checks test every spider and flag breakage. Queue system for bulk-adding sites. No vendor lock-ins, self-hosted, ~4,000 lines of Python. Apache 2.0. GitHub: https://github.com/discourselab/scrapai-cli Docs: https://docs.scrapai.dev/ Also posted on HN: https://news.ycombinator.com/item?id=47233222 submitted by /u/Routine_Cancel_6597

Originally posted by u/Routine_Cancel_6597 on r/ClaudeCode

ScrapAI: AI builds the scraper once, Scrapy runs it forever

ScrapAI: AI builds the scraper once, Scrapy runs it forever