What happens when AI has read everything?

Are you a writer scared that you might soon be without a job? Fear not, according to an article in the Atlantic, it seems like we may need more books now than ever. According to a research team led by Pablo Villalobos, as soon as 2023, we may already run out of high-quality language data, and somewhere between 2030 and 2070, we’ll be likely to have run out of vision data.

LLMs perform better when trained on books

Now, whereas we’re pretty much generating labeled visual data every day on our social media feeds, we’re considerably slower to replenish our written language data. Because writing takes time. And LLMs are picky readers: large language models trained on books are much better writers than those trained on huge batches of social-media posts.

The Atlantic article then goes on to explain that we might soon run into a shortage of these high quality data sources, especially books. They mention Google researchers estimating that from the more than 125 million books that have been published since Gutenberg brought printing to Western Europe, between 10 to 30 million have already been digitized, and therefore possibly already be in AI’s training data. But that’s just a fraction of what future LLMs will be able to ingest.

Let’s put dongles around our neck and record speech acts

The speculative solutions that are mentioned in the Atlantic article sound rather ominous, to be honest:

AI could create synthetic training data itself, where an LLM could be “like the proverbial monkeys with typewriters, only smarter and possessed of functionally infinite energy. They could pump out billions of new novels, each of Tolstoyan length”.
We humans could provide data to the AI: “we could all wear dongles around our necks that record our every speech act”, or we could harvest our text messages or record the keystrokes of all white-collar workers.

But why?

Even though the researchers remark that these solutions are currently not feasible nor acceptable, I keep wondering.

Suppose that AI is able to create billions of novels, what’s the point if they’ll never be read by humans?

Sure, I get it, from a machine learning point of view, it may sound logical that if LLMs perform best on books, we should give them books. But from a humanistic point of view, I really wonder what it means to basically have no human in the loop in adding to what’s possibly the most important data set out there: our collective human knowledge.

What would it mean to generate billions of novels and feed them to an AI, without any human having seen, read, or evaluated those texts? Especially when we don’t know yet how AI will be used in the future? Right now, ChatGPT is clearly recognisable as an app. And you can choose to use it or not. But what if these AIs get more integrated in our lives? More ambient? Would you like to base your information consumption, your decision making, and disclosure of your personal data based on a background algorithm that’s basically feeding home-grown fiction back into itself?

And what does it say about us humans, when we’re willing to let algorithms produce the very thing that writers struggle to earn a living with, and fewer and fewer people read for their own pleasure, benefit or learning? Books as cheap, replaceable mass fodder for machines. Is this really something we should aspire?

That is, if it’s even possible at all, as my 15-year old nephew remarked yesterday when we were discussing this topic. ‘If AI can’t generate anything new, how can it write something original? Or invent a new literary genre? Create its own style?‘

Why are books so good for training LLMs in the first place?

That begs the question: what exactly is it in books that makes LLMs do better? Is it the larger volume of text? Larger and more varied vocabulary? The higher chance of cohesion & coherence, rhetorical devices and stylistic choices that are inherent in creating longer texts? Undoubtedly these text-related aspects play a role.

But I wonder whether it might be also something more fundamentally human? Writing a book takes effort, it takes time, and it takes someone who basically does the thinking for you so you don’t have to do that. A well-written text effortlessly guides you through a topic, an instruction or through the writer’s mind, heart and soul.

That means that the writer needs to know things on a deeper level than her reader. For that, she needs to know who she’s writing for. She needs to find the common ground where both writer and reader can meet, not only in terms of knowledge, but also in an emotional and often spiritual sense. In many cases, that’s a result of lived experience. That, to me, is the beating heart of the writing process. Transferring, sharing and rejoicing in lived experience.

Might it be that book-trained LLMs perform better because books were written by people who feel and think?

Nick Cave: a grotesque mockery of what is to be human

Two weeks ago, Nick Cave made the news with his blog on a ChatGPT song “written in the style of Nick Cave”, calling it “a grotesque mockery of what it is to be human”.

So now what?

It seems hard to stop this movement. Then again, we’ve had AI winters before. But imagine…just an idea…if LLMs benefit from books, if ML needs us to be language savvy, if we feel this is important, can’t we make this a shared opportunity for both ML and humanity and increase the creation of human-written books? What if tech would re-invent itself as the mecenas of literature? With funds for authors of all genres, initiatives to get people reading again. And rankings of which books should make it into the ML canon for algorithms?

I guess I’m hopelessly old-fashioned and idealistic, but one can dream.

Artificial Intelligence: A guide for thinking humans