Bias in the Training Data Leads to a Biased Algorithm

Datatechvibe Interview with Sarah Lowe, Vice President Business Development at Appen

Your model is only as good as the data that feeds it, says Sarah Lowe, Vice President of Business Development (EMEA), Appen, a technology company that focuses on high-quality training data to confidently deploy world-class AI. 

Using training data that is not inclusive of all intended users of the model leads to an inadvertent bias which isn’t simply a social blunder. It’s an unsuccessful model that will ultimately be a failure for business and the community. This is one of the reasons that Appen works with a pool of skilled annotators (over one million worldwide) who operate in over 170 countries and more than 235 languages.

Yolande D’Mello speaks to Sarah about human involvement in machine learning, creating responsible AI, and the Middle East’s banking and retail sectors are leveraging automation to improve the customer experience.  

How does Training data impact the success of an AI model? What role does Appen play? 

The most important aspect of building and deploying a successful AI model is the data, the training data. A model needs to be trained on high-quality labelled data to be useful and successful in the real world. For example, if you are creating a customer service chatbot, in that case, the data may be different ways to ask, ‘What is my account balance?’ both in text and audio, which is then translated to different languages to service global requirements.

Appen is the leading provider of high-quality training data. We collect data and label images, text, speech, audio, video and provide the industry’s most advanced AI-assisted data annotation platform to build and continuously improve artificial intelligence systems. 

You may have the most appropriate algorithm, but if you train your machine on bad data, it will learn the wrong lessons, fail out, and not work as you or your clients expect. Successful AI is almost entirely reliant on the right data. Appen underpins all these aspects of data supply and data annotation technology and service as a data partner. 

Also Read: Why We Need to Govern AI 

Is computer vision achieving human-like capabilities? What will be its impact? 

Computer vision is becoming one of the fastest-growing fields of AI. Image data is more readily available than ever before, and we foresee an ongoing need for image and video annotation at scale. With a continuous stream of training data, a computer vision system has the potential to achieve some very human-like capabilities, emulating the human ability to see and interpret the world around us – but at speed and in very large volumes. The constant training and re-tuning of data inputs will be the most important part to the success of the CV model. These models have the ability to be faster and more productive, driving revenue and reducing costs. 

Which industries in the Middle East are showing the highest levels of adoption of AI? 

Banking and Retail are some of the largest growing industries focusing on automation to improve the customer experience. Some of the use cases being implemented are customer service chatbots, IT process and business automation improvements, and improving web search and eCommerce content relevance. Speech and language services are the fastest-growing segments due to the increased use of Automatic Speech Recognition (ASR) models to automate customer interactions, improve voice assistants and track engagement via sentiment analysis. And of course, self-driving cars is always a high-profile industry for anywhere in the world, following the automotive industry and its developments for autonomous driving computer vision applications closely.

How serious is Algorithmic bias in AI, and how can it be overcome? 

Any bias in the training data will lead to a biased algorithm, resulting in an unsuccessful model and ultimate failure. The training data needs to be inclusive of all intended users of the model or it will have an unintentional bias. To overcome a bias, the data needs to be balanced, diverse and sourced from a broad group of people. This is why Appen’s crowd of skilled annotators is over one million worldwide, operating in over 170 countries and more than 235 languages. This diverse resource pool is maintained to ensure our customers can get their training data labelled using just the right profile and range of resources. For example, if your in-house data is skewed towards your current customer, the model will only work for your current customer. The data needs to include all types of potential customers so the model is trained on any users it may interact with. 

Also Read: How AI Is Transforming Renewable Energy

What is responsible AI, and why is it important? 

Responsible AI can mean something different to each organisation. Our definition includes both the input and output of the AI model. It means the model outcome is unbiased and equitable. It works for all intended users and ideally improves the quality of life for everyone it touches. It also means the data was sourced ethically and fairly, workers were paid a fair wage and treated well. 

With the advances in data annotation software, will human involvement be important? 

Human-in-the-loop will continue to be the main branch of machine learning. It combines human intelligence, human judgement, and machine intelligence to build machine learning models. Training data is not labelled or collected on its own. Machines need to be trained on the data collected, labelled, or annotated by humans. Machine learning assistance or smart labelling capabilities are built into data annotation platform technologies like Appen’s to leverage and enhance quality, accuracy, and annotation speed. But whilst technology can help improve some of the annotation tasks, human intelligence is still super important for creating high-quality training datasets.