A new approach to guarantee 0 down-time in production ML models.
ML models are static, but data is dynamic and constantly changing.
It’s just a matter of time before the model is no longer relevant, or even worse — starts making costly predictions.
Why is Data Drift so dangerous?
Let’s look at a simple example:
Suppose we have a visual inspection station taking pictures of camera lenses right after cutting and polishing the glass. Images from several lenses are used to create an unsupervised model (see how we do it) to screen defective lenses before being assembled into the cameras and shipped to customers.
Let’s even suppose the model is good, even though there are a tiny number of defects, and they are very hard to find. The model nonetheless can screen a very high percentage of all the defects while rejecting a minimal number of good lenses.
Unfortunately, the optical inspection machine is prone to real-world problems and slips out of perfect calibration over time. Over time, the visual inspection machine’s images are less and less in focus, darker, and contain more contaminates.
At some point, the images are so different from the original images that the model spits out wrong predictions, and faulty cameras are shipped to customers.
This scenario is common, and on average, data drifts such as this can happen on a monthly or even weekly case.
If the model breaks once a month, it is unusable for any real-world business application.
Let’s take a look at a few common ways to address drift:
Detect, and then… do something
A good practice is to use a drift detector.
Some use open-source drift detectors, some develop in-house solutions, and some use 3rd party solutions (MLOps is a booming industry).
All options serve the same basic functionality of catching and alerting about possible drifts. It is then up to the customer to decide if and how to react.
However, this option does require a lot of ML\AI expertise, a real-time monitoring service, and a lot of DevOps.
Periodically retrain a new model
A better practice is to retrain a model blindly every period.
For example, a customer can choose to retrain a model once a month based on that’s month’s data.
There are, however, some significant drawbacks to this approach:
- How can we know the correct period? too short, and we may never reach a good model; too long, and the accumulated errors of the imperfect model might incur too high a price.
- How can we know we’ve collected enough new data during this period? Without enough data, a new model may not be possible to train.
- Do we have a data collection and management system in place?
It’s important to note that while this approach is automatic, it does not prevent the loss over the period in which the data has drifted, but a new model has been introduced.
Periodically update the existing model
In some cases, it is possible to train a new model starting from an existing model rather than from scratch.
New Data is used to essentially extend the training phase of the model, taking into account that the existing model was ‘close enough.’
This approach solves the need to tackle the ‘how much data do we need’ question; however, it amplifies the data engineering efforts required to collect, manage, train, evaluate, and push to production improved models.
Give new data less weight
In this approach, new data is given lesser weight than historical data to make predictions.
This approach is only suitable for scenarios where the data is time-dependent (like time-series, NLP, etc.) and for business use-cases where units are interconnected.
1A new model to work with the existing one
The existing model remains static in this approach, while a new model uses the latest data to correct the drift.
The two models are then used in tandem to make predictions; the former and the latter correct the errors of the former.
The essential advantage of this approach is that changes are limited to a small discrete area of the solution. However, the major drawback is the amount of engineering required and a limited capacity to verify the correcting model’s performance.
A new approach – Self wiring networks
Self-wiring networks are neural networks that construct themselves based on the data and target provided. This is in contrast to traditional methods where the network’s architecture has been set (like ResNet50, for example), and only the weights are fitted based on training data.
There are several techniques and fields of study on this matter that gaining a lot of traction over the past few years, most notably NAS (a good overview here).
Our particular method consists of a proprietary library of computational building blocks. These are based on a mix of collective experience in the subject matter and in-house IP and can range from simple mathematical functions (like ReLu) to complex mini-networks.
A general outline of the training process is:
1. For each new layer:
- Select best building blocks
- Select # of building blocks
- Select best interconnections
2. Decide if a new layer is needed
3. Update the solver with results of the previous layer
4. Repeat until performance is achieved or no more gains an example :
It’s interesting to note that:
- The number of layers is not set, and will that which best fits the data
- At each layer, the number and type of node can change
- The network never used f5, and that’s ok
Real-time network adaptation
let’s explore two possible types of drift and how the adaptive-ai handles it in real-time with 0 down-time.
Drift #1 — a feature is dropped.
Removing data or measurements is common in many manufacturing and testing environments where the engineering team has decided to remove a test from the testing plan.
Drift #2 — a feature changes dramatically in character
Distributions changes are common in many manufacturing and testing environments where measurement machines slip out of calibration, changes are made to the manufactured product itself, or changes are made to the materials used to build the product.
In both cases, the adaptive AI follows the same outline:
- The raw data changes its characteristic without notice
- The online monitor picks up the change in performance
- The network identifies the most affected nodes
- The network fits a repair in real-time
It’s important to note that:
- The network has decided to keep the original number of layers but drop the first node of the second layer all together
- The compound effect of dropping a feature has impacted only the first two layers
Why it solves data drift?
Let’s reconsider the 4 modes of data drift:
- Sudden drift — a new concept occurs within a short time. A dynamic model will automatically notice the change and retrain itself with the new target when this happens.
- Gradual drift — a new concept gradually replaces an old one over some time. A dynamic model will automatically notice the difference and automatically identify the right time to introduce a new mode when this happens.
- Incremental drift — an old concept incrementally changes to a new concept over some time. A dynamic model will continuously retrain itself, tracking and following the new concept.
- Reoccuring Concepts — an old concept may reoccur after some time. A dynamic model will switch target concepts in real-time without losing performance and toggling between states.
Let’s review a real example using real-world data.
The data is a tabular dataset containing 208 features, a label, a timestamp, and an indexer:
Using Vanti, we trained a model to classify OK / Not OK :
The report above gives a high-level summary of the trained model’s performance. for a more technical view of the results, the confusion matrix is also provided:
We can see that the model is pretty good, and for the business use case it’s intended for, it can generate significant improvements in yield and bottom-line profit for the customer.
To analyze the impact of dropping a feature on performance, we will use a new batch (provided by the customer) that contains drifted data features (provided this way by the customer) and V53 removed (a feature indicated to be very important in the report) by Vanti.
- Old model
- A new model trained only drifted data
- Refined model
- Old model with correction
In the real world, data is dynamic, while traditional ML models are static.
Dynamic data lead to out-of-date models (that might harm), and it is a MUST to notice and react to these changes.
Our proposed method is the only method that both guarantees 0 down-time and the best possible performance.