Beating bias in AI datasets is more than just data diversity—it's a balance of data and training

A group of researchers from MIT, Harvard University, and Fujitsu asked the question “Can machine-learning models overcome biased datasets?” and put it to the test.

They used an approach from neuroscience to study how training data affects whether an artificial neural network can learn to recognize objects it has not seen before. A neural network is a machine-learning model that mimics the human brain in the way it contains layers of interconnected nodes, or “neurons,” that process data.

The new results show that diversity in training data has a major influence on whether a neural network is able to overcome bias, but at the same time dataset diversity can degrade the network’s performance. They also show that how a neural network is trained, and the specific types of neurons that emerge during the training process, can play a major role in whether it is able to overcome a biased dataset.

The crucial discovery was that the more diverse the dataset, the less effective the network is at detecting things it recognises:

“But it is not like more data diversity is always better; there is a tension here. When the neural network gets better at recognizing new things it hasn’t seen, then it will become harder for it to recognize things it has already seen”

Researchers also found that multitasking can have adverse effects too, compared to training models separately:

In machine learning, it is common to train a network to perform multiple tasks at the same time. The idea is that if a relationship exists between the tasks, the network will learn to perform each one better if it learns them together.

But the researchers found the opposite to be true — a model trained separately for each task was able to overcome bias far better than a model trained for both tasks together.

“The results were really striking. In fact, the first time we did this experiment, we thought it was a bug. It took us several weeks to realize it was a real result because it was so unexpected,” he says.

It’s an interesting study and has ramifications for how we adapt machine learning for projects that need to be accurate to avoid deaths, wrongful convictions, and diminished quality of life for everyone but especially marginalised groups who are routinely harmed by careless use of AI.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.