Garbage In Garbage Out: The Problem Of Data Labelling

October 21, 2021

Consider this figure: $15 million per year. That’s the research firm Gartner’s estimate of the average financial impact of poor data quality per organisation. Data, one of the most valuable resources available to businesses, is only useful if it’s high-quality.

“Good training data leads to a good predictive model, and would it not be nice just to clean up the data set fed in. There is a double data quality whammy when it comes to machine learning,” said Tom Redman, the data doc who helped many of the Fortune 100 companies improve data quality.

Inevitably then, data needs refinement, which comes in the form of data labelling. Aside from the quality of the data that an organisation collects, it also needs to know the value of label quality. Although there is no benchmark for identification, it’s essential to know if the data is duplicate or redundant. Also, the degree to which the data portrays reality.

The characteristics of good data are that it should be accurate and up-to-date, relevant and free from redundancies, meaning it contains only complete records, is consistent, as a steady pattern is essential to keep the database manageable.

Valid data is critical for the strategic and operational steering of every organisation. Having appropriate data quality processes in place directly correlates with an organisation’s ability to make the right decisions at the right time, creating a stronger supply chain, ensuring compliance with industry standards and reducing cost by cutting down on errors. Not only with customers, but relationships with trading partners also improve when the business is backed by good quality data.

Since more and more decisions are based on data, bad data quality is a growing problem leading to poor decision-making, business inefficiencies and mistrust. Hence, identifying “garbage” data is the first step in unlocking data’s vast potential for an organisation. Since a bad dataset can never produce good data algorithm models, data cleaning, which prevents distortion of algorithms, and formatting are primary steps.

In machine learning (ML), data labelling is the process of identifying raw data — images, text files, videos — and adding meaningful and informative labels to provide context so that a ML model can learn from it. For example, labels might indicate whether a photo contains a bird or car. Also, labelling specific components such as each interaction with customers trains advanced algorithms to recognise patterns in future similar datasets.

Also Read: Is Ethical Hacking Our Last Defence?

Efficient data labelling

ML algorithms are designed to recognise opportunities for improved automation, and labelled data helps to anticipate follow-up customer concerns, suggest relevant products and services. Successful ML models are built on large volumes of high-quality training data. Many AI and data labelling companies practice active learning, where data is added incrementally to achieve the highest accuracy with the least amount of data. The process to create the training data necessary to build these models is often complicated and time-consuming, and the majority of models require a human to manually label data to make correct decisions. ML helps to overcome this challenge by using a model to label data automatically.

In this process, a ML model for labelling data is first trained on a subset of raw data that has been labelled by humans. The human-generated labels are then provided back to the labelling model to learn from and improve its ability to automatically label the next set of raw data. Over time, the model can label more and more data automatically and substantially speed up the creation of training datasets.

While automation sounds promising, allowing IT teams to finally get rid of tedious work, it is important to train an ML model by showing it thousands of examples and let it memorise the correlations between the data pieces. The more you feed into it, the more accurate its predictions will be.

Increased volume increases the risk of mistakes and low-quality data pieces. And so an organisation needs to know what data is used, who uses it, how and for what purposes. Ensuring security within an organisation is crucial, but it is even more important if it decides to outsource your data labelling task. It is crucial to check if the third party firm knows the appropriate data governance techniques and will provide the security of your datasets.

To make sure the ML algorithm will make no mistakes, it’s important to analyse the dataset, as the data labelling of AI training sets is prone to inaccuracies: blind spots, which is a lack of training on certain types of data, and biases, which favour a certain variety of data. The analysis of labelled data is tricky in semi-supervised ML, as there is a risk of generating pseudo-labelled data based on inaccurate tags or classes. After the primary labelling, human experts are crucial to decide what is the necessary action.

A flaw made in the training dataset can have disastrous results. For example, Amazon discontinued use of a recruiting algorithm after discovering gender bias.

The data that engineers used to create the algorithm was derived from the resumes submitted to Amazon over a 10-year period, which were predominantly from white males. The algorithm was taught to recognise word patterns in the resumes, rather than relevant skill sets, and these data were benchmarked against the company’s predominantly male engineering department to determine an applicant’s fit, resulting in gender bias.

Also Read: Bias in the Training Data Leads to a Biased Algorithm

It is critical for an organisation to create a policy that explains how to create training labels correctly and according to its strategic goal. There should be no room for interpretation.

Identification and measurement of the bias is the first, and most significant step that can be taken to reduce ML bias. Also, surveillance over data collection sources is a must, and authentication of data is imperative. Finally, the sorting and filtering of data should be transparent.

Data literacy is also important. Data-driven organisations need more people with the ability to interpret data and to draw insights. Data literacy empowers all levels of workers to ask the right questions of data and machines, build knowledge, make decisions, and communicate meaning to others. IDC forecasts a ten-fold increase in worldwide data by 2025. Organisations increasingly need to produce data literate employees who contribute more to their roles and help businesses sharpen their competitive edge.

Companies like Bloomberg and Adobe now have data science and digital academies that are focused on helping employees learn how to analyse data. It is a responsibility more employers should embrace, and doing that, an organisation can reap the rewards.

Garbage in garbage out has bedevilled data experts for decades. Now, it requires a new way of thinking — the accuracy of an organisation’s trained model will depend on the accuracy of its objective standard, so spending time and resources to ensure highly accurate data labelling is essential.

The benefits of improving data quality go far beyond reduced costs. Improving data quality enables an organisation to more easily pursue data strategies, and good data-driven decision-making markedly improves business performance.

Garbage In Garbage Out: The Problem Of Data Labelling

Consider this figure: $15 million per year. That’s the research firm Gartner’s estimate of the average financial impact of poor data quality per organisation. Data, one of the most valuable resources available to businesses, is only useful if it’s high-quality.

Efficient data labelling

Latest Posts

OpenAI’s o3-Pro Is Here; Open-Weights Model Delayed

Mistral AI Unveils Its First Reasoning Model

Meta’s Zuckerberg Hiring for New ‘Superintelligence’ AI Team: Report

Apple Says AI Models Collapse When Facing Hard Puzzles

Meta in Talks to Invest in Scale AI

Reddit Sues Anthropic Over Alleged Data Scraping for AI Training