It's time to brush up robots.txt

What funny time to live in. (open)AI is eating the tech world, if at least in the news, but the same old story goes: just as in Google old days, content gets scraped and used for lucrative purpose without much consent being asked beforehand to rights holders.

"It's public content". Sure it is oh dear, but there is a neat difference with when Google used to index your content: avid Google searchers would still get redirected to your site to get full access to the answer or viewpoint you were bringing to their question. With the rise of conversational search, admitted that it does indeed become the new norm, this potential traffic will just never have the chance to reach your origin anymore. Publishers used to cry at Google when they launched featured snippets but chatGPT will be even more nastier in answering people's questions without attributing with the source of the asnwer used as part of the training.

Should we care? In my humble opinion, yes and some other mighty folks seem to not be happy with this situation either. The opacity around the sources of training datasets used as part of large language models seems to be bringing so many downsides that it feels unlikely LLM will be able to advance past its current state without first properly attributing training corpus sources transparently.

Besides the data transparency issue, there are also more pracitcal ones: public data scraped from the WWW comes in all shapes and format: txt, html, pdf, jpeg, ... the data cleaning operation is a massive yet tedious task to carry before being able to play. "Good thing, let's not make it easy for scrapers to steal data" fair point, even though scraping is still a thing as of today, despite numerous efforts and budgets deployed by companies over the years to protect their digital assets from such collection. It seems like a problem woth solving: why could not we give a way for bots aimed at training AI to go to another service, through a documented API, that would give them access to the content they are looking for enriched with metadata and in a clean, machine-digest format?

Where's does robots.txt fit in that picture though? To me it has always seemed like a good idea initially: you have bot traffic, some you care about (Google crawler) and you want to give the good bot indications on how to consume content. Simple, effective. The missing part though is that this file is not authoritative. If the bot could not care less and still want to scrape everything, it will do, without much consequences. You then have to deploy some defensive mechasnism through bot management solutions to protect your content. But in the case of AI, I think we may have another way to look at robots.txt. Actually, I would propose to rename it scrape.txt. It would live at the root of a site and lists explorable endpoints, just as robots.txt. Those endpoints would point to a dedicated API that would allow the bot to specify the preferred format for the content, check the content license and pay applicable fees for downloading the cleaned data. Easier for bot builders, AI trainers and fair to content creators, that would receive a share of the cake.

It's all imaginary for now but it's the path I am willing to explore with faie.io in the immediate future. Let me know what you think!