POSTED : March 17, 2021
BY : Raj Nair
Hear me out, maybe just maybe, I might win some of you over.
There is no taking away from the fact that building machine learning and AI algorithms are a hard and challenging endeavor. However, in the grand scheme of things the larger ecosystem around this is really critical to the success of this effort. In a recent whitepaper published by Google labeled “Hidden Technical Debt in Machine Learning Systems,” the authors presented the following visual:
That’s right, somewhat in the line of “missing the forest for the trees,” in the grand scheme of things, model development is a smaller piece of the larger puzzle. The success of Enterprise Data Science really depends on all of the other pieces of the puzzle collectively working together. From my experience, there are some key areas that require attention to make data science successful in enterprises:
Notebooks are great, but they don’t represent software best practices, not yet at least. They are great for data scientists to collaborate and explore. Modularization, unit testing, bootstrap code, framework, identical environments, are all important considerations for robust and scalable ML development and deployment.
Model development cannot succeed in isolation and model deployment is not independent of IT infrastructure. ModelOps is therefore gaining popularity in production ML environments as addressing “last mile” issues of model deployment and consumption. ModelOps refers to the tools and processes deployed to operationalize and manage all models in use in production systems (statistical, ML, and AI models). ModelOps tools and processes strive to provide dashboards, reporting, and information for stakeholders. One important aspect of ModelOps, specifically for machine learning models, is the comprehensive monitoring of these ML pipelines, which we will cover in some detail below.
The reality on the ground is that ML lifecycles are long and complex. Once you have what you might call a “candidate algorithm”: something that has performed well on the training dataset and meets initial requirements expected of it, you now have to build a production pipeline environment that includes data integration pipelines, access to production datasets, robust transformation logic, feature stores, feedback loops, server and storage infrastructure, high availability and load balancing. And it’s not enough to build this pipeline. You may have contractual SLAs that you need to maintain and report on. So building metrics capabilities all along the pipeline are critical. In general, metrics can be classified into model metrics and operational metrics.
Model metrics capture important performance characteristics of the model. More importantly, metrics are needed to understand and monitor two important concepts that are unique to ML pipelines: model decay and drift. Depending on the type of algorithm, examples of metrics include:
Model decay is a term used to capture the reality that models degrade over time. The reasons for that happening are closely associated with our next concept “drift”. At a high level, there are two types of drifts: concept drift and data drift. Concept drift is used to refer to the change in relationships between the inputs and outputs of the model. This could be for example customer buying behavior has changed because of economic factors or completely new products being introduced in the market (remember how phones were not just phones when the iPhone debuted?).
Data drift, on the other hand, is when the distribution of the data changes. You are a bank trying to offer personalized service. Maybe your training data was based largely on branch walk-ins and a small amount of data that was from online. During the launch, the pandemic hit, and a very large amount of data came in through online banking. In the training phase, the model hadn’t actually done so well with the online data.
Operational Metrics are another important aspect. Examples include:
Capturing metrics is great, but taking action based on them is crucial. This brings us to the topic of model management.
Model management covers the following important aspects of a model deployed in a production pipeline:
I am going to touch on Drift and Data Quality Management a bit deeper.
Let me use this cliché one more time: “garbage in, garbage out.” Data quality needs to be addressed early on and in different parts of the pipeline. Data quality is probably the single largest determinant of good model performance.
Quality of input data: Missing values, null fields, noise, bad labels—all of these influence model training.
Quality of features: Feature stores are beginning to play a more prominent role in ML quality and performance. Centrally managed features are important for scale and quality.
To this effect, it may be time to explore the concept of “data unit tests”: expressing expectations of data upfront such as data ranges, expected values, etc.
We touched on the topic of drift earlier. Monitoring and managing drift is a key aspect of model management. Metrics we capture can give us some serious insights into drift patterns. For instance, a change in the number of calls to a model may be an indicator of something wrong. Visualizations of the distribution of features are also useful. Observing the distributions over time can help spot changes that are maybe not normal. For example, over time, the range of values may have changed (maybe an error in downstream processing or maybe a real change). Or, you may be able to spot the dreaded “flat-lining” case, where you are expecting a continuously changing value but suddenly, for an extended time frame, you see the value flat-lining.
There are so many factors that influence the performance and scale of machine learning pipelines that it deserves an article on its own. To give you an idea, consider the following:
So food for thought. I hope this article highlights the critical broader ecosystem of machine learning and why they are all important even if they are not related to algorithm development. And just maybe, some converts to my controversial title.
Learn how to implement a data science strategy for your organization.
Raj Nair is VP of Intelligence and Analytics at Concentrix Catalyst. His work includes researching creative solutions to challenging data problems, crafting elegant approaches to scaling data science algorithms and building plug-ins for the big data integration ecosystem.