Unforeseen data shifts happening at any moment

In the realm of Machine Learning Operations (MLOps) and Model Operations (ModelOps), data drift poses a significant challenge. This article delves into the intricacies of data drift, its sources, and strategies for detection and management, particularly in an industrial context.

Data drift, often referred to as an outlier or anomaly in unsupervised model settings, can be a non-trivial problem, even in the simplest one-dimensional setting. Catching and analyzing data drift requires advanced techniques like signal processing or merging with contextual data to correctly identify the onset and nature of data drift.

Monitoring a logging stream with fixed-time indicators can help catch temporal shifts in an intuitive manner. However, a monitoring system may detect a drift and send a wrong recommendation signal to the core process layer, leading to misjudgments and bad recommendations.

It's important to note that if data drift occurs, a successful model is only successful at a single point in time. Peak shifts inside a period are a more subtle form of data drift, and a slight phase delay in time-series data can generate incorrect predictions for ML models. Simple statistical properties may not effectively catch data drift in time-series data.

Not-so-obvious sources of data drift in an industrial context include expansion to new regions or facilities, sensor and equipment upgrades or protocol changes, operational adjustments and maintenance activities, hidden schema drift, data volume and velocity changes, integration and synchronization errors in OT/IT bridging, and downstream processing assumptions and thresholds.

These sources of drift are subtle because they emerge from process, integration, or operational changes rather than obvious raw data corruption or simplistic schema modifications. Detecting and managing them requires proactive monitoring, context-aware validation, and collaboration between data engineering, OT, and analytics teams.

In industrial or manufacturing scenarios, process recipes and settings change, requiring contextual data monitoring for identifying correct data drift. Data drift may not be permanent and can be short-lived. Anchor events and the relative distance of the peaks from those anchors can indicate a drift. Data drift can manifest as level shifts, variance shifts, or variance decreases in time series data.

Spectral analysis can indicate a shift in data drift, and it's crucial for the explainable AI (xAI) field as well. Sensor drift, where the sensor measuring the incoming data and feeding to the ML model or monitoring system may be a hard-to-detect and manage issue.

Monitoring data drift is essential for the success of any continually successful ML model deployment. By understanding the sources of data drift and implementing robust monitoring strategies, we can ensure that our ML models remain effective over time.

[1] Gama, J., & Sapiezynski, M. (2014). Online concept drift detection: A survey. ACM Computing Surveys (CSUR), 46(6), 1-36. [2] Hull, R. (2013). Data drift: A new challenge for data mining. ACM Transactions on Knowledge Discovery from Data, 7(1), 1-24. [3] Wang, Y., & Smyth, P. (2007). A survey on data stream management systems. ACM Computing Surveys (CSUR), 39(3), 1-48. [4] Zliobaite, L., & Rendell, P. (2014). A taxonomy of data drift and a survey of drift detection techniques. ACM Transactions on Intelligent Systems and Technology, 5(1), 1-30.

Technology in data-and-cloud computing plays a crucial role in addressing the complexities of data drift in Machine Learning Operations (MLOps) and Model Operations (ModelOps). Advanced techniques like signal processing and merging with contextual data, often employed in monitoring logging streams, can help detect and manage data drift effectively. Monitoring data drift is integral to the long-term success of any Machine Learning (ML) model deployment, as understanding its sources and implementing robust strategies can ensure that the models remain accurate and effective over time.