A personal view of Australian and International Politics : ChatGPT and other AI models were trained on copyrighted books. Can they be 'untrained'?

Wednesday, 29 November 2023

ChatGPT and other AI models were trained on copyrighted books. Can they be 'untrained'?

Extract from ABC News

By science and technology reporter Jack Ryan

Posted 29 minutes ago

Artificial intelligence platforms have been trained on copyrighted material. Should artists be paid for their contributions?(Getty Images: Olemedia)

Toby Walsh, one of Australia's most recognisable experts in artificial intelligence, has been thinking about the future of AI for a long time. He's even written books about it.

So it wasn't exactly a surprise when Professor Walsh, also chief scientist at UNSW's AI institute, discovered several AI models had been trained using a dataset of pirated books — including one of his own.

"The irony is not lost on me," he said.

The database of 190,000 titles, called "Books3", included works by authors such as Stephen King and Margaret Atwood, as well as many Australian authors, including Geraldine Brooks, Liane Moriarty and Tim Winton.

Books3 is just one example of human-created copyrighted materials Big Tech has taken — without permission — to train algorithms called large language models.

Large language models underpin generative AI tools such as ChatGPT.

TobyBaxter4 — Professor Toby Walsh found one of his books in a dataset used to train generative AI models.(Supplied)

Some AI training sources are known and in the public domain. Wikipedia, for instance, has been used to train AI. But for the most part, these training datasets are a tightly guarded secret.

"There was a time where the people making these models will tell you what they were trained on," Professor Walsh, also the chief scientist at UNSW's AI institute, said.

"Now, they won't tell you anything."

There has been some insight, however. In August, an expose by US publication The Atlantic revealed Books3 was used to train Meta's LLaMA, a version of Bloomberg's AI and several other generative AI programs.

The Books3 saga has spawned class-action lawsuits over copyright infringement, with artists and authors seeking damages for theft of their work which carries on, in perpetuity, with every new model that's released.

The AI training genie is out of the bottle. Can we jam it back in and "retrain" AI on non-copyrighted materials?

Or is it already too late?

Visiting the AI 'library'

No analogy adequately captures the complexity underpinning generative AI models such as ChatGPT, but you can think about these services like a library — albeit a very strange one.

In this library, an abundance of human knowledge has accumulated. But it's not stacked neatly on the shelves, arranged in alphabetical order or using the Dewey decimal system.

Instead, everything — books, articles, patents, webpages, recipes, the entirety of Wikipedia — is jumbled together in a single book.

Only the generative AI, our "librarian" of sorts, can access that information and use it to respond to questions we ask.

Library Gabriel Sollmann Unsplash — Generative AI models are trained on billions of data points -- but that data isn't arranged neatly, like in a library.(Gabriel Sollmann/Unsplash)

You might ask "Who built the Roman Empire?" or "Can you write me a short story in the style of Tim Winton?" and the librarian will answer.

But it won't provide a response by retrieving a single book on Roman history or by copying passages from Winton's novels.

Instead, it merely remembers patterns in this one giant book. It computes which words are most likely to follow other words in a sentence. Then it strings those words together based on probabilities.

"How the next word is suggested comes from a vast mathematical computation," said Simon Angus, a computer scientist at Monash University.

Dr Angus says there's no one location where, for instance, Winton's work is stored. Even the creators of generative AI models would not be able to find and pull out his works.

Yet, those works are arguably crucial for AI tools to function as successfully as they do — but creators and authors were never asked for permission to use them.

Law and order: AI

Since July, a handful of lawsuits have taken aim at some of the biggest players in generative AI, including ChatGPT creator OpenAI, Meta and Microsoft.

They mostly revolve around one major question: Should AI companies be allowed to train their models on copyrighted works?

One of the first to make headlines featured US comedian Sarah Silverman and two other authors.

The trio launched a lawsuit against Meta for its use of the Books3 dataset, which contained their published works.

A young pale man in a blue T-shirt speaks in front of a blurry background with the letters AI projected on a wall. — Meta CEO Mark Zuckerberg recently showed off the company's AI tools, including celebrity chatbots. (Reuters: Carlos Barria)

The lawsuit argued that Meta infringed on the authors' copyrights by training its generative AI model, LLaMA, using their books without consent.

It also argued that any time a model responds to a user query, it's infringing on the author's copyright, even if that question is unrelated to the author's work.

Meta and OpenAI did not respond to requests for comment.

A recent judgement hints that creators face an uphill battle convincing the court on at least some of the copyright infringement allegations.

On November 10, the judge presiding over Silverman's case, US district judge Vince Chhabria, granted Meta's motion to dismiss several claims.

He was particularly unfavourable to the authors' claims that LLaMA, the tool itself, infringed on copyright.

"That makes my head explode when I try to understand that," he said.

The major pain point — the use of copyrighted works to train AI — is still to be tested in the Silverman case, and a number of others.

This, Professor Walsh says, will mostly revolve around "fair use" — the ability to use copyrighted material for specific purposes.

But experts like Professor Walsh worry current copyright laws may not be fit for purpose when it comes to generative AI.

Moreover, the legal system just isn't built to move as quickly as AI development, according to Dr Angus.

"The law is not miles behind, it's light-years behind," he said.

If the law comes down on the side of the AI companies and gives the green light to training on copyrighted work, what recourse will creators have?

Realignment is a potential solution

The question around training AI on copyrighted works will likely be resolved in court, but what of the current AI models being used by millions of people each day?

It's impossible to remove copyrighted works, but it's not impossible to stop those works from being regurgitated by an AI tool.

Remember the library? Training AI models on a huge swath of human knowledge means they're also exposed to sexist, racist, ethically dubious content found on the internet.

To combat this, AI companies have trained their models with human feedback, getting real people to rank an AI's output so it provides more acceptable, ethical answers to user queries.

Dr Angus suggests this process, known as "alignment", could also be used to block models regurgitating copyrighted material.

Essentially, an author might request their works not be used or reproduced by an AI tool. Then, human feedback could train AI models to stop them from spitting out summaries of work in their style.

Today's AI models don't quite have the capability to reproduce entire books word-for-word but Dr Angus notes that it's quite possible they will in the future.

"The models will probably get better and better at doing that."

He also notes this would be quite a large undertaking for AI developers and creators and, even doing so, it might not prevent workarounds.

Users will still seek out specific prompts to get AI tools to spit out exactly what they want.

"The internet will be full of people to trying to jailbreak the model," he said.

The best of both worlds

AI companies have plenty of arguments against paying creators.

Reporting by US tech website The Verge lists a variety of reasons developers like Google, Microsoft and Meta used to justify not paying for copyrighted work using in training AI.

Meta, for instance, argues copyright holders compensation "would be incredibly small in light of the insignificance of any one work among an Al training set" and Google suggests "any prohibition or imitation on the use of copyrighted materials for purposes of Al training would ... undermine the purpose of copyright and foreclose the many opportunities that come with this technology".

The opportunity cost is something Dr Angus also pointed out.

He agreed artists and creators should be compensated for their work but preventing AI from ingesting particular material might also hamper the tools' usefulness.

For instance, he proposed that not including Margaret Atwood's novels in training means that it won't necessarily be exposed to her worldview.

"That's potentially a loss," Dr Angus said.

Ultimately, the experts the ABC spoke with agreed that we've passed the point of no return. The genie is out of the bottle and there's no way to jam it back in.

Save for total annihilation of the AI tools, Big Tech and the creative industry must chart a course that's mutually beneficial, respecting copyright holders and ensuring AI innovation and development can continue — in the right way.

"We might be able to get the best of both worlds, where creatives are rewarded for their art and at the same time, these models have the rich knowledge and understanding that artists bring to the world," Dr Angus said.

A personal view of Australian and International Politics