Explained: Model Drift 


Machine Learning models are used to understand and predict terms across industries. Commonly, they provide insights on supply chains, customer propensity, interest rate arbitrage and more. However, these are among millions of analytical models that fundamentally remain consistent over time. 


Gradualism is an evolution that happens more gradually. Over a short period of time, it is hard to notice. In punctuated equilibrium, change comes in spurts. Most changes in customer behaviour are akin to gradualism. However, the pandemic threw most changes into a state of punctuated equilibrium, without much balance.

For example, banks working with legacy systems and retailers that operated out of brick-and-mortar stores had to find a way to serve customers through digital channelsovernight or risk losing out. 

Model drift, also known as model decay, refers to the degradation of a model’s prediction power due to changes in the environment, leading to a change in the relationship between variables. 

This is not a surprise to data scientists, ML models are unique software entities therefore, they are susceptible to fluctuation over time. In addition, such models need to be maintained periodically post-deployment to revalidate their business value on an ongoing basis. Yet, the changes are subtle and nuanced, unlike what the pandemic caused — an overhaul. 

Also Read: Datatechvibe Explains: Federated Learning

There are three main types of model drift;

  • Concept drift where the properties of the dependent variable changes. 
  • Data drift where the properties of the independent variables change like seasonality or alterations in consumer preferences. 
  • Upstream data changes refer to operational data changes in the data pipeline. An example of this is when a feature is no longer being generated, resulting in missing values.

To detect drift, ML teams use model performance metrics like AUC (Area under the ROC Curve), precision and recall, which monitor the real-time performance of production models. However, these metrics need ‘ground truth’ or labels for these real-time predictions. For example, shifting customer preferences to get food and groceries delivered at home instead of going to the restaurant or grocery store due to a pandemic.

It’s not the pandemic alone that is to be blamed. Consider an ML model build to track spam emails. Most likely, it will be based on a generalised template of spam emails that may have been proliferating at the time. With this baseline in place, the ML model can identify and stop these sorts of emails, thus preventing potential phishing attacks. However, as the threat landscape changes and cybercriminals become smarter, more sophisticated and realistic emails replace the old ones. When faced with these newer hacking attempts, ML detection systems operating on variables from prior years will be ill-equipped to classify these new threats properly. This, too, is an example of model decay since it degrades the model’s predictive ability. 

Also Read: Cyber Attacks That Shook The World

Detecting and dealing with model drift

F1 scores are used to calculate the precision and recall capabilities of the model to evaluate its accuracy. It’s called a statistical analysis of binary classification. Here is how it works; in the F-score, precision is the number of true positive results divided by the number of all positive results, including those not identified correctly. Recall is the number of true positive results divided by the number of all samples that should have been identified as positive. 

Whenever a specified metric drops below a set threshold, then there’s a high chance that model drift is ruining the results. So whether the model is used for medical predictions of populations or business outcomes like profit percentages, the results will not be trustworthy. 

In several cases, measuring the accuracy of a model isn’t possible due to difficulty in comparing it against actual data. This is a major challenge when it comes to scaling ML models. Here, regular refitting of models based on past experiences can help to create a predictive timeline for when drift might occur in a model. Solutions like IBM’s Watson and Databricks’ ML models have special features in place to track such decay on their models that are built for clients. They issue alerts when it’s time to redevelop or update models. 

One of the problems developers often face with production environments is that the deployed model doesn’t behave the same way as it did during the training phase. First off, it’s important to establish that this isn’t due to a bias in sampling data. Amazon’s SageMaker, a  fully managed service that helps developers deploy ML models at scale, comes with a monitoring feature that triggers actions if any drift in model performance is observed.