Original Reddit post

My hot take : if I want to collect data from a website and I’m writing code to automate it, there are generally some accepted rules of the road. Check the sitemap. Look at robots.txt. Respect rate limits. Follow the website’s preferences where possible. What I find interesting is that most AI agents I’ve used seem completely indifferent to any of that. They’ll happily generate a scraper that makes hundreds of thousands of requests, spins up Playwright sessions, rotates through pages, and generally optimizes for “get me the data” rather than “should I be doing this?” Given how accessible AI-assisted coding has become, it feels inevitable that ordinary people—not just companies—will start operating their own scrapers. Especially for high-value information that appears deceptively easy to obtain, like Google search rankings, product data, or job-market intelligence. That creates an obvious headache for Google, ATS platforms, and basically every website on the internet if everyone and their mother starts firing up Playwright sessions in Python. The part I’m struggling with is responsibility. Is this something AI providers should be thinking about? If Anthropic, OpenAI, Cursor/Anysphere, etc. can generate increasingly sophisticated collection tools, do they have any obligation to consider the downstream effects? At the same time, I don’t see an obvious solution. The moment you start adding guardrails, you risk making these tools dramatically less useful for legitimate research, accessibility, automation, and software engineering work. Maybe this has already been solved and I’m missing something. Curious how people here think about it. (Ended up writing a longer piece on this after a weird experience involving AI-generated search ranking data and web-integrated LLMs: personal blog ) submitted by /u/TacoTuesdayX

Originally posted by u/TacoTuesdayX on r/ArtificialInteligence