Original Reddit post

built a research agent last week that scrapes competitor landing pages and summarizes changes. felt pretty clean honestly. except i didn’t account for one thing, half the sites it was hitting had started serving bot detection pages instead of real content. my agent didn’t know the difference. just kept “summarizing” cloudflare challenges and empty divs like they were real content. 6 hours. hundreds of API calls to my LLM. all on garbage HTML. the actual useful data i got back? maybe 12 pages out of 200. i’m not managing my own scraping infrastructure for AI agents anymore. what are you guys using that actually returns clean content and fails gracefully when it hits a wall? tired of babysitting this stuff submitted by /u/LxM420

Originally posted by u/LxM420 on r/ArtificialInteligence

  • Treczoks@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    7 hours ago

    Anyone who wastes other peoples resources to scrape the net deserves any bill that comes for it. You are the problem that needs to be fixed ASAP.