Intro to Post-deployment model performance

gema.parreno.piqueras
5 min readApr 26, 2022

By the time machine learning models are deployed, they face reality and its evolution over time. Performance degradation is a machine learning phenomenon that happens once the machine learning model is deployed into production and can be defined as its devaluation or deterioration over time : it is a complex and exciting challenge because, overall, it has a significant impact on the business problem and it is silent.

Usually, performance degradation is usually detected a posteriori when we can compare the predictions-y_hat- with the true label -y or actuals’ …So, How do we measure performance estimation in the absence of the targets? Using OpenSource NannyML .

Fig 1. The business challenge example we will be working on. We predict if a worker will work from home in a binary classification scenario on a given day. Each Dataset row contains a unique identified observation about a worker on a day of the week and considers several features, such as the distance from the office, transportation costs, and if the worker worked from home the previous day.

Why do machine learning models fail?

A machine learning model is generally trained on historical or static data . However, environment and key conditions lead to changes in the data and, therefore, in patterns that the data holds, leading the model to degrade its predictions over time. There exist two main reasons why machine learning models fail :

  • Data Drift: we can define this as unpredictable changes in the distribution of the data inputs encoded in the features that the model will infer. It might impact the fitted function or how accurate the classifier is. Even though this might be easily spotted in small datasets, it can be challenging when we have hundreds of features in large scale datasets.
  • Concept Drift: change in the decision boundary or the overall classifier pattern between inputs and the target. The classifier captures a snapshot of that relationship, but as the concept changes, this snapshot becomes less and less accurate. So in other words the behaviour of the classifier does not change, but the underlying pattern does, and this pattern is no longer a good representation of reality. [1]
Fig 2. Concept drift and Data drift. Factors like seasonality can cause Data Drift, and one of the ways to detect concept drift is to plot feature importance for the classifier: one example of Data drift in work from Home prediction challenge is that we recently hired some people who work far from the office or a later initiative of pizza Fridays that make workers return.

The main difference in between data drift and concept drift is that data drift is associated with the change in the statistical properties of the independent variables associated with features, while concept drift is a change in the pattern in between the inputs and the outputs associated with the classifier or when is a major change in the statistical properties of the dependent variable . NannyML performance estimation focus on detecting data drift and multivariate data drift

How does it work in NannyML ?

In real-world problems, we usually face the challenge of having a time-lapse between the truth label and the prediction, or maybe even not having the actual label. So the very high-level idea is to base the prediction on thresholds associated with the probability estimation and compare data from the analysis and reference partitions to detect data drift.

The library offers a probability estimation algorithm under the hood. However, for running performance estimation in NannyML, we should take some things into account :

  • Partitions: the dataset gets split into reference and analysis partitions. The reference partition sets the expectations and needs the true label(y), and it represents a data period in which we are sure that the model is performing well; if the model hasn’t been deployed yet, we could use the test set here for the baseline model. The analysis partition doesn’t need the real label: the confidence-based performance estimation algorithmic approach of NannyML is done in the analysis period.

The reference partition will have both the true predicted label and the predicted probability and we should know that it is data related to good model performance. Analysis partition will only have the predicted probability and not the real target label. Review Data partition documentation.

  • The need of the score / predicted probability ( predicted_proba or decisition_function Sklearn method, depending on the classifier ) of the positive class as a column inside the dataset , a unique identifier for each observation, and in this case a TimeStamp at which the observation occurred.

For fitting the classifier, we need to extract the metadata and we need to declare the target column name

occurred.metadata = nml.extract_metadata(reference)
metadata.target_column_name = ‘work_home_actual’
Fig3. How the metadata looks like. NannyML will automatically detect some of them, but you might want to double check them here. If you have separated reference and analysis datasets, you need an extra column with a categorical partition name.

And now, let’s fit the Confidence based performance estimator!

cbpe = nml.CBPE(model_metadata=metadata, chunk_period="D")
cbpe.fit(reference_data=reference)

Data chunk . We create chunks based on time intervals, but there is a need to understand data chunks depending on the business case: NannyML will do automatic chunking of reference partition if not specified. Please pay attention to this, as it might influence the performance estimator.

Fig4. [2] Call the Confidence-based performance estimator, and interpret model decay, including it in timing and performance threshold. The graph represents the estimated ROC AUC curve over time in between an acceptable threshold. The blue and purple plotted lines refer to the reference(access to the label ) and analysis( not access to the label ) partitions. Regarding performance threshold, and given that a good AUC is near 1, it does make sense that the optimal threshold might be in between the given range. The model shows a model decay since late 2019.

At this point, we might know that the performance drops in 2019, but how do we know what happened ?

Fig5. [2] The framework plots the distribution of continuous and categorical features in reference and analysis partitions and can print alerts for data drift in a dataset . At this point, we can see a change with respect to the distribution of the work from home previous day feature and distance from the office. The first figure shows that there has been a change in how far workers are from the office, tending to live farther, causing model decay . In the second figure, the boolean distribution of the feature work from home previous day.

We were able to share a real business case , in PyData Madrid meetup , within a real world challenge with real world data .

Fig6.Business challenge example we will be working with at PyData Madrid meetup . We predict a subscription after a one-month trial for a mobile recipes dessert app. There is a one-month time-lapse between the real label (y) and the predicted label(y_hat). Therefore, we have an absence of the ground truth for a month. How can we know that our model is performing well without losing customers? The app includes four main activities, with historical data from one month, with device and geolocation information about the user. The base binary classifier model ( xgboost ), in which label 1 represents the positive class(that the user converts) and label 0 represents that the user does not convert: we deployed into production shows an ACC of 0.766 of and F1 score of and will be the model we will use for performance estimation.

Conclusions
Measuring performance in machine learning models in production is hard, and being able to react to the absence of the actual label during performance with an Open Source tool after deployment has a significant impact for Data Scientists on controlling how machine learning works in production.

[1] thanks to NannyML co-founder feedback correction!

[2]Figures coming from NannyML docs

--

--