Wednesday 3 July 2024

The world's biggest AI models were trained using images of Australian kids, and their families had no idea.

Extract from ABC News 

ABC News Homepage


In short:

Images of Australian children were found in a dataset called LAION-5B, which is used to train AI.

The images have since been removed from the dataset, but AI models are unable to forget the material they're trained on, so it's still possible for them to reproduce elements of those images, including faces, in their outputs.

What's next?

The federal government is expected to unveil proposed changes to the privacy act next month, including specific protections for children online.

The privacy of Australian children is being violated on a large scale by the artificial intelligence (AI) industry, with personal images, names, locations and ages being used to train some of the world's leading AI models.

Researchers from Human Rights Watch (HRW) discovered the images in a prominent dataset, including a newborn baby still connected to their mother by an umbilical cord, preschoolers playing musical instruments, and girls in swimsuits at a school sports carnival.

"Ordinary moments of childhood were captured and scraped and put into this data set," said Hye Jung Han, a children's rights and technology researcher at HRW.

"It's really quite scary and astonishing."

The images were found in LAION-5B, a free online dataset of 5.85 billion images, used to train a number of publicly available AI generators that produce hyper-realistic images.

Researchers were investigating the AI supply chain following an incident at Bacchus Marsh Grammar School, where deepfake nude images of female students were allegedly produced by a peer, using AI. 

HRW examined a sample of 5,850 images from the collection, covering a broad range of subject matter — from potatoes to planets to people — and found 190 Australian children, from every state and territory.

"From the sample that I looked at, children seem to be over-represented in this dataset, which is indeed quite strange," Ms Hye Jung said.

Scientist with AI concept
Human Rights Watch sampled roughly 5800 images from one controversial dataset called LAION-5B.

"That might give us a clue [as] to how these AI models are able to then produce extremely realistic images of children."

The images were gathered using a common automated tool called a "web crawler", which is programmed to scour the internet for certain content.

HRW believes the images have been taken from popular photo and video-sharing sites including YouTube, Flickr, and blogging platforms, as well as sites many would presume were private.

"Other photos were uploaded [to their own websites] by schools, or by photographers hired by families," said Hye Jung Han, adding that the images were not easily findable via search, or on public versions of the websites they came from.

A child's hand holds a small paint brush with two adult arms in frame also painting circles
One controversial dataset called LAION-5B, contained a total of 5.85 billion images overall. This image was not discovered in the dataset. (ABC Kimberley: Vanessa Mills)

Some images also came with highly specific captions, often including children's full names, where they lived, hospitals they'd attended, and their ages when the photo was taken.

The revelations are a wake-up call for the industry, according to Professor Simon Lucey, Director of the Australian Institute for Machine Learning at the University of Adelaide.

He says AI is in a "wild west" phase.

"If there's a data set out there, people are going to use it," he said.

Headshot of Prof Simon Lucey, who is staring at the camera wearing a brown shirt.
Professor Simon Lucey is an AI image expert at the University of Adelaide. (Supplied)

'The harm is already done'

According to the experts, AI models are incapable of forgetting their training data.

"The AI model has already learned that child's features and will use it in ways that nobody can really foresee in the future," Ms Hye Jung said.

Additionally, there's a slim but real risk that AI image models will reproduce elements of their training data — for example, a child's face.

"There has been quite a lot of research going into this … and it seems to be that there is some leakage in these models," Professor Lucey said.

There are no known reports of actual children's images being reproduced inadvertently, but Dr Lucey said the capability was there.

He believes there are certain models which should be switched off completely.

"Where you can't reliably point to where the data has come from, I think that's a really appropriate thing to do," he said.

Anonymous hands with red-painted nails holding a phone.
Professor Lucey said some AI models should be switched off completely.(ABC South East SA: Kate Hill)

He emphasised though that there were plenty of safe and responsible ways to train AI.

"There's so many examples of AI being used for good, whether it's about discovering new medicines [or] things that are going to help with climate change.

"I'd hate to see research in AI stopped altogether," he said.

Images of Australian children deleted from dataset

The dataset LAION-5B has been used to train many of the world's leading AI models, such as Stable Diffusion and Midjourney, used by millions of people globally.

It was created by a German not-for-profit organisation called LAION.

In a statement to the ABC, a LAION spokesperson said its datasets "are just a collection of links to images available on [the] public internet".

They said, "the most effective way to increase safety is to remove private children's information from [the] public internet".

In 2023, researchers at Stanford found hundreds of known images of child sexual abuse material (CSAM) in the LAION-5B dataset.

LAION took its dataset offline and sought to remove the material, before making the collection publicly available again.

New prep-aged children sit on floor in circle in prep classroom playing barrel of monkeys game at Spring Mountain State School.
Images of Australian children from every state and territory were found in the dataset. This image was not discovered in the dataset. (ABC News: Tim Swanston)

LAION's spokesperson told the ABC, "it's impossible to make conclusions based on [the] tiny amounts of data analysed by HRW".

The organisation has taken steps to remove the images discovered by HRW, even though they've already been used to train various AI generators.

"We can confirm that we remove all the private children data [sic] reported by HRW."

HRW didn't find any new instances of child sexual abuse imagery in the sample it examined, but said the inclusion of children's images was a risk in its own right.

Hye Jung Han smiles looking off camera.
Hye Jung Han is a researcher and advocate in the Children's Rights Division, where she specialises on children's rights and technology at the Human Rights Watch.(Supplied: HRW)

"The AI model is able to combine what it learns from those kinds of [sexualised] images, and… images of real Australian kids," Ms Hye Jung said.

"[It] essentially learns from both of those concepts … to be able to then produce hyper-realistic images of Australian kids, in sexualised poses."

'We need governments to stand up for the community'

While the use of children's data to train AI might be concerning, experts say the legalities are murky.

"There are very, very few instances where a breach of privacy leads to regulatory action," said Professor Edward Santow, a former Human Rights Commissioner and current Director at the Human Technology Institute.

Edward Santow looks down he barrell of the photo in a casual business suit
Edward Santow is the Director, Policy and Governance at the Human Technology Institute, and Industry Professor at the University of Technology Sydney.(Supplied: UTS)

It's also "incredibly difficult" for private citizens who might want to take civil action, he said.

"That's one of the many reasons why we need to modernise Australia's Privacy Act," he said.

The federal government is expected to unveil proposed changes to the Act next month, including specific protections for children online.

Mr Santow said it was a long-overdue update for a law that was mostly written "before the internet was created".

"We have a moment now where we need governments to really stand up for the community … because pretty soon in the next year or two, that moment will have passed," he said.

"These [AI] models will all have been created and there'll just be no easy way of unpicking what has gone wrong."

HRW is also calling for urgent law reform.

"These things are not set in stone… it is actually possible to shape the trajectory of this technology now," Ms Hye Jung said.

Schools are increasingly on the frontline of AI generated child abuse material(Rhiana Whitson)

No comments:

Post a Comment