AI, creativity and data curation

If I am not fully clear yet on all the opportunities that IA will reveal in the next decade, one point that seems obvious to me now is that originality and data curation will become even more critical to help us navigate the jungle of new content created by such bots.

But first, let's set the stage. Seven years from now, AI-generated content now accounts for 75% of the whole internet corpus. Only one percent of this is seeded from human prompts. IA did not magically turned everyone into Stephen King, and the easiest way to get new content was to rebranch outputs of a model as input for another. If this worked for a period, the model improvements started to cap, due to the every-increasing redundancy of content used as training data.

The ever increasing scarcity of originality will make it precious, to differentiate content from one another and to keep the model evolving. With Microsoft investment in OpenAI, I have heard numerous times that MSFT would now be back in the game and challenge Google. That seems reasonable, but let me susbstitute another thought: tomorrow's search engine will mainly be used programatically only by content harvesters for the sake of training IA models. Humans will happily move to some semi-conversational way of searching through this new online gigantic collection.

Coming back to my original point: what benefit would a search engine provide to such crawlers? Data curation and cataloging indeed! Challenges will remain the same as the one faced nowadays by Google: to index content in a way that is easily searchable. But the difference here is that the business model will not rely on advertising, the way I see it now at least. The gold nugget will reside in the originality and classification quality of the content. I see two use cases from a business interest perspective: 1/ Training data filter and model fine tuning: Rich metadata allows one to find great content to tune the training corpus based on final consumer persona. Let's say for instance, an ecommerce platform that needs to present a personalized content experience for each of its identified audiences. What better way to create bounds with a stranger than to speak the same dialect? To really fine tune the tone, that platform could decide to train the model on the content sources consumed by members of that audience, be it blogs, newspapers, etc... Individuals or companies would opt-in to get their content indexed and would manually advertise, through some metadata, the typology of content that they produce for other to consume, either freely or for a premium/subscription. 2/ Data governance: with the increasing mass of content available, models will inevitably end up ingurgitating texts that do not align with the individual/company values or messages. This would become an issue in case of public incidents due to such mistraining. Clear data classification will provide some help to those entities for debugging and identifying what materials / sources may have caused the training issue.

This could be a golden age for artists and creators. But my guts tell me that, should that scenario happen, this will actually lead to an even more fragmented internet, where rich, unique and creative content will get siloed behind walled gardens. Just me being pessimistic I believe, sorry.

Anyway, it's easy to make predictions, even easier to have them wrong. As always, time only will tell.