Intro to Post-deployment model performance
By the time machine learning models are deployed, they face reality and its evolution over time. Performance degradation is a machine learning phenomenon that happens once the machine learning model is deployed into production and can be defined as its devaluation or deterioration over time : it is a complex and exciting challenge because, overall, it has a significant impact on the business problem and it is silent.
Usually, performance degradation is usually detected a posteriori when we can compare the predictions-y_hat- with the true label -y or actuals’ …So, How do we measure performance estimation in the absence of the targets? Using OpenSource NannyML .
Why do machine learning models fail?
A machine learning model is generally trained on historical or static data . However, environment and key conditions lead to changes in the data and, therefore, in patterns that the data holds, leading the model to degrade its predictions over time. There exist two main reasons why machine learning models fail :
- Data Drift: we can define this as unpredictable changes in the distribution of the data inputs encoded in the features that the model will infer. It might impact the fitted function or how accurate the classifier is. Even though this might be easily spotted in small datasets, it can be challenging when we have hundreds of features in large scale datasets.
- Concept Drift: change in the decision boundary or the overall classifier pattern between inputs and the target. The classifier captures a snapshot of that relationship, but as the concept changes, this snapshot becomes less and less accurate. So in other words the behaviour of the classifier does not change, but the underlying pattern does, and this pattern is no longer a good representation of reality. [1]
The main difference in between data drift and concept drift is that data drift is associated with the change in the statistical properties of the independent variables associated with features, while concept drift is a change in the pattern in between the inputs and the outputs associated with the classifier or when is a major change in the statistical properties of the dependent variable . NannyML performance estimation focus on detecting data drift and multivariate data drift
How does it work in NannyML ?
In real-world problems, we usually face the challenge of having a time-lapse between the truth label and the prediction, or maybe even not having the actual label. So the very high-level idea is to base the prediction on thresholds associated with the probability estimation and compare data from the analysis and reference partitions to detect data drift.
The library offers a probability estimation algorithm under the hood. However, for running performance estimation in NannyML, we should take some things into account :
- Partitions: the dataset gets split into reference and analysis partitions. The reference partition sets the expectations and needs the true label(y), and it represents a data period in which we are sure that the model is performing well; if the model hasn’t been deployed yet, we could use the test set here for the baseline model. The analysis partition doesn’t need the real label: the confidence-based performance estimation algorithmic approach of NannyML is done in the analysis period.
The reference partition will have both the true predicted label and the predicted probability and we should know that it is data related to good model performance. Analysis partition will only have the predicted probability and not the real target label. Review Data partition documentation.
- The need of the score / predicted probability ( predicted_proba or decisition_function Sklearn method, depending on the classifier ) of the positive class as a column inside the dataset , a unique identifier for each observation, and in this case a TimeStamp at which the observation occurred.
For fitting the classifier, we need to extract the metadata and we need to declare the target column name
occurred.metadata = nml.extract_metadata(reference)
metadata.target_column_name = ‘work_home_actual’
And now, let’s fit the Confidence based performance estimator!
cbpe = nml.CBPE(model_metadata=metadata, chunk_period="D")
cbpe.fit(reference_data=reference)
Data chunk . We create chunks based on time intervals, but there is a need to understand data chunks depending on the business case: NannyML will do automatic chunking of reference partition if not specified. Please pay attention to this, as it might influence the performance estimator.
At this point, we might know that the performance drops in 2019, but how do we know what happened ?
We were able to share a real business case , in PyData Madrid meetup , within a real world challenge with real world data .
Conclusions
Measuring performance in machine learning models in production is hard, and being able to react to the absence of the actual label during performance with an Open Source tool after deployment has a significant impact for Data Scientists on controlling how machine learning works in production.
[1] thanks to NannyML co-founder feedback correction!
[2]Figures coming from NannyML docs