Recent studies show biases and other mistakes in many of the datasets used for training, benchmarking, and testing models, highlighting the danger of putting too much trust in data.
Like gasoline fuel cars, datasets fuel machine learning (ML) and AI models. Tasked with different requirements such as generating text, recognising objects, or predicting a company’s stock price, AI systems “learn” by going through numerous examples to understand basic patterns within the data.
Computer vision systems, for example, are trained to recognise different types of objects by looking at additional images of several varieties. But the use of datasets is beyond just developing models. Datasets, today, are used to test trained AI systems to ensure their stability and later measure the overall progress.
But since humans design these AI and ML datasets, they aren’t without their flaws. Recent studies show biases and mistakes in many of the datasets used for training, benchmarking, and testing models, highlighting the danger of putting too much trust in data that hasn’t been thoroughly checked and regulated. Benchmarking is one of the most trusted ways to determine a model’s predictive strength.
In AI & ML, benchmarking refers to comparing the performance of multiple models that have been designed for a similar task, like translating words between languages. This practice in Data Science originated with academics exploring early applications of AI, giving the advantages of organising scientists around shared problems while helping to reveal and understand how much progress has been made.
But there are several risks in becoming myopic towards dataset selection. If the same training dataset is used for many tasks, it is improbable that the dataset will accurately reflect the data that models might see in the real world. Misaligned datasets, when used, can distort the measurement of scientific progress, causing harm to people in the real world.
In a recently published study by a group of researchers at the University of California, named Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research, investigated the problem. It was found that there has been a “heavy borrowing” of datasets in ML, – a community working on one task might borrow a dataset created for another task, raising concerns about misalignment. It also showed that only a dozen universities and corporations are responsible for creating the datasets that are used more than 50 per cent of the time in ML, suggesting that these institutions are shaping the research agendas in the field.
Another study from MIT revealed that several computer vision datasets, including ImageNet, contain “nonsensical” signals that can be problematic. Models trained on these suffer from “overinterpretation”, a phenomenon where they classify images lacking in detail, which becomes meaningless to humans when observed. These signals can lead to uncertainties like model fragility in the real world, but they are somehow valid in the datasets, meaning overinterpretation cannot be identified using typical methods.
There are tons of examples from the past that show the consequences of deploying models trained using flawed datasets. In 2015, a software developer pointed out that the image-recognition algorithms and models in Google Photos were labelling his black friends as “gorillas”.The non-profit organisation, AlgorithmWatch, observed that Google’s Cloud Vision API marked thermometers held by a black person as “guns” while labelling thermometers that were held by a light-skinned person as “electronic devices.” A study by researchers at the University of Maryland found that face-detection services from conglomerates like Amazon, Microsoft, and Google seem more likely to fail with older, darker-skinned individuals and those who are less “feminine-presenting”.
Data Labeling Issues
Labels are annotations from which many models learn relationships in data and bear data imbalance’s hallmarks. Generally, humans annotate the examples in training and benchmark datasets, adding labels like “dogs” to pictures of dogs. But annotators come with their own biases and shortcomings, translating to imperfect annotations. In MIT’s analysis of popular benchmarks, including ImageNet, the researchers found mislabels in images, like a breed of dog being confused for another that looks similar. Amazon product reviews were described as negative when they were actually positive in real life, and audio of YouTube videos where high notes were being categorised as a whistle.
One potential solution can be the creation of more inclusive datasets, like MLCommons’ People’s Speech Dataset and the Multilingual Spoken Words Corpus. But curating these datasets is often time-consuming and expensive. Another reason creating a dataset is so costly is the domain expertise required for high-quality annotations. Most low-cost labellers can only annotate “low-context” data and cannot handle “high-context” data such as legal contract classification, medical images, or scientific literature.
Other methods are now being used to replace real-world data with partially or entirely synthetic data, although there has been no rigorous conclusive evidence on whether models trained on synthetic data can match the accuracy of their real-world-data counterparts.
The issues with AI & ML datasets don’t just stop with training. The Institute for AI and Decision Support, Vienna, found inconsistent benchmarking across more than 3,800 AI research papers, where many cases were attributable to benchmarks that didn’t emphasise informative metrics. A document from Facebook and the University College London showed that 60 per cent to 70 per cent of answers given by natural language models tested on “open-domain” benchmarks were hidden somewhere in the training sets, meaning that the models simply memorised the answers.
Several solutions to the benchmarking problem have now been proposed for specific domains, such as the GENIE by Allen Institute. GENIE’s AI incorporates both automatic and manual testing, tasking human evaluators with searching for language models according to predefined and dataset-specific guidelines for fluency, correctness, and conciseness. But GENIE is expensive, and it costs around $100 to submit a model for benchmarking. Allen Institute also plans to explore other payment models, such as requesting payment from tech companies while subsidising small organisations’ costs. There’s also a growing consensus within the AI research community that benchmarks, particularly in the natural language domain, must consider the broader ethical, technical, and societal challenges if they are to be useful. Some language models have large carbon footprints, but despite widespread recognition of the issue, only a few researchers attempt to estimate or report the environmental cost of their systems.
With such extensive challenges in AI & ML datasets, from imbalanced training data to inadequate benchmarks, attaining meaningful change won’t be easy, but the situation isn’t hopeless.
There is abundant research now showcasing the ethical problems and social harms that can arise from data misuse in ML and AI. These research reports have directly caused a rift amongst many data creators, and hopefully, somewhere in the near future, we might just see potential solutions as labs from several tech giants are now focusing more on addressing such mishaps.
If you liked reading this, you might like our other stories