MLOps is a set of practices that includes Machine Learning, DevOps and Data Engineering elements. The main aim of the union is reliable and efficient deployment and maintenance of Machine Learning systems in production.
In other words, it is a way to turn machine learning methods and technologies into a useful tool for solving business problems.
It is important to understand that the production chain begins long before the development of the model. Its first step is to define a business problem, a hypothesis about the value that can be extracted from the data, and a business idea for applying it.
The very concept of MLOps arose as an analogy to the concept of DevOps in relation to models and technologies of machine learning. DevOps is an approach to software development that allows you to increase the speed of implementation of individual changes while maintaining flexibility and reliability using a number of approaches. These approaches include continuous development, dividing functions into a number of independent microservices, automated testing and deployment of individual changes, global health monitoring, rapid response system detected failures, etc.
DevOps defined the software lifecycle, and the idea came up in the community to use the same technique for big data. DataOps is an attempt to adapt and expand the methodology taking into account the peculiarities of storing, transferring and processing large amounts of data in a variety of and interacting with each other platforms.
With the emergence of a certain critical mass of machine learning models implemented in business processes of enterprises, a strong similarity between the life cycle of mathematical machine learning models and the software life cycle was noticed. The only difference is that the algorithms of the models are created using machine learning tools and methods. Therefore, the idea naturally arose to apply and adapt already known approaches to software development for machine learning models. Thus, the following key stages can be distinguished in the life cycle of machine learning models:
- defining a business idea;
- training the model;
- testing and implementation of the model in the business process;
- operation of the model.
When it becomes necessary to change or retrain the model on new data in the process of operation, the cycle is restarted - the model is refined, tested, and a new version is deployed.
In this life cycle, it seems logical to use DevOps tools: automated testing, deployment and monitoring, formalizing the calculation of models in the form of separate microservices. But there are also a number of features that prevent the direct use of these tools without additional ML strapping.
In this article, we will review MLOps best practises, MLOps Risks & Challenges and MLOps pros and cons that automate and shorten the machine learning cycle.
What is MLOps?
MLOps (combining machine learning technologies and processes and approaches to implementing the developed models in business processes) is a new way of cooperation between business representatives, scientists, mathematicians, machine learning specialists and IT engineers in creating artificial intelligence systems.
MLOps is a culture and set of practices for integrated and automated lifecycle management of machine learning systems, combining their development and operations maintenance, which includes integration, testing, release, deployment and infrastructure management. We can say that MLOps solution extends the CRISP-DM methodology with the help of an Agile approach and technical tools for automated operations with data, ML models, code and environment. Putting MLOps into practice is expected to help avoid common pitfalls and problems faced by Data Scientists working in accordance with the classic phases of CRISP-DM. We will also talk about other advantages that this concept gives to business further.
How does MLOps differ from DevOps and DataOps?
By analogy with DevOps and DataOps, business has a need to organize continuous cooperation and interaction between all participants in the processes of working with machine learning models from business to engineers and Big Data developers, including Data Scientists and ML specialists. This business need is connected with the popularization of Machine Learning methods and the growth of their practical implementations.
The concept of MLOps is still quite young. For the first time, the professional community spoke publicly about the need for integrated lifecycle management of machine learning in industrial operation (production) around 2018, after one of the Google presentations. Despite its relatively young age, the demand for MLOps is getting bigger and bigger every day.
In practice, the problem of introducing ML-models into a real business is not limited to data preparation, development and training of a neural network or other Machine Learning algorithm. The quality of a production solution is influenced by many factors, from verification of datasets to testing and deployment in a production environment in the form of a reliable Big Data application. This means that the actual results of prediction or classification depend not only on the neural network architecture and the machine learning method proposed by Data Scientist, but also on how the development team has implemented this model, and the administrators have deployed it in a clustered environment.
Also, the quality of the input data (Data Quality), sources, channels and the frequency of their receipt are important, which belongs to the area of responsibility of the data engineer. Organizational and technical obstacles in the interaction of multidisciplinary specialists involved in the development, testing, deployment and support of ML solutions lead to an increase in the time required to create a product and a decrease in its value for the business. To remove such barriers, the concept of MLOps was invented, which, like DevOps and DataOps, seeks to increase automation and improve the quality of industrial ML solutions, paying attention to regulatory requirements and business benefits.
Main Benefits of MLOps Solutions.
- Rapid innovation through robust machine learning lifecycle managemen
MLOps solution, or DevOps for machine learning, makes collaboration possible not only for data processing teams, but also for analysis professionals and IT engineers. It also increases the speed of model development and its deployment with the help of monitoring, validation and management systems for machine learning models.
- Creation of reproducible workflows and models.
The main MLOps benefits allow you to:
- reduce variation in model iterations and provide resiliency for enterprise-level scenarios with reproducible learning and models;
- use dataset registries and advanced model registries to track resources;
- provide improved traceability by tracking code, data, and metrics in the execution log;
- сreate machine learning pipelines to design, deploy, and administer reproducible model workflows for consistent model delivery.
- Easy deployment of high precision models in any location.
With the help of MLOPs solution you can:
- deploy high precision models quickly and confidently;
- use automatic scaling, managed clusters of CPUs and GPUs with distributed learning in the cloud;
- pack models quickly, ensuring high quality at every step through the use of profiling and model validation;
- use managed deployment to migrate models to the production environment.
- Effective management of the entire machine learning life cycle.
MLOPs benefits enable you to:
- use the built-in integration with Azure DevOps and GitHub actions to plan, automate, and manage workflows efficiently;
- streamline training and model deployment pipelines, use CI / CD to simplify retraining, and integrate machine learning easily into existing release processes;
- use advanced data bias analysis to improve model performance over time.
- Machine Learning Resource Management System and Control.
MLOPs benefits also help you to:
- keep track of version history and model origin to enable auditing;
- evaluate the importance of features and create more advanced models with minimal bias using uniform distribution metrics;
- set calculation quotas for resources and enforce policies to ensure compliance with security, privacy, and compliance standards;
- create audit trails to meet regulatory requirements as you mark machine learning resources and automatically trace experiments.
MLOps Risks & Challenges
A comprehensive answer to the fundamental question of business about how applicable ML-models are for solving problems and the general question of AI and MLOps adoption are key challenges in the development and implementation of MLOps approaches. Initially, one of the biggest MLOps disadvantages is that businesses are skeptical about introducing machine learning into processes as it is difficult to rely on models in places where people used to work in the past. For business, programs appear to be a "black box", the relevance of the answers to which has yet to be proven. Besides, in certain spheres, for example, in banking, telecom operators business and others, there are stringent requirements of state regulators. All systems and algorithms that are implemented in banking processes are audited. To solve this problem, to prove to business and regulators the validity and correctness of the responses of artificial intelligence, monitoring tools are introduced along with the model. In addition, there is an independent validation procedure required for regulatory models that meets the requirements of the Central Bank. An independent expert group audits the results obtained by the model, taking into account the input data.
The second challenge is the assessment and consideration of model risks when implementing a machine learning model. A well-known internet debate about whether the dress was white or blue showed that there might be cases when even people cannot answer a question with absolute certainty. Then it would only be natural that artificial intelligence also should have the right to make a mistake. It is also worth considering that data can change over time, and models need to be trained to produce a sufficiently accurate result. So that the business process does not suffer, it is necessary to manage model risks and monitor the model's performance, regularly retraining it on new data.
But after the first stage of mistrust, the opposite effect begins to appear. The more models are successfully introduced into processes, the more the business grows an appetite for the use of artificial intelligence - more and more new problems are found that can be solved by machine learning methods. Each task launches a whole process that requires certain competencies:
- data engineers prepare and process data;
- data scientists use machine learning tools and develop a model;
- IT implements the model into the system;
- ML engineer determines how to correctly integrate this model into the process, which IT tools to use depending on the requirements for the mode of application of the model, taking into account the flow of calls, response time, etc.
- ML architect designs how a software product can be physically implemented in an industrial system.
The whole cycle requires a large number of highly qualified specialists. At a certain point of development and the degree of penetration of ML-models into business processes, it turns out that it becomes expensive and inefficient to scale the number of specialists linearly in proportion to the increase in the number of tasks. Therefore, the question about “How to automate and launch an MLOps process?” arises - the definition of several standard classes of machine learning problems, the development of standard data processing pipelines and additional training of models. In the ideal picture, solving such problems requires professionals who are equally well versed in competencies at the intersection of BigData, Data Science, DevOps and IT. Therefore, the biggest problem in the data science industry and the biggest challenge in organizing MLOps processes is the lack of such competence in the existing training market. Specialists who meet such requirements are currently unique in the labor market and are worth their weight in gold.
MLOps Best Practices
One of the main concepts of Data Engineering is the data pipeline. A data pipeline is a cycle of transformations that are applied to data between its source and an end destination. They are usually explained as a graph in which each node is a transformation and edges represent dependencies or execution order.
We’ve already established that in order to be successful we need a mixed team that has that skill set. Most likely it would consist of a Data Scientist (ML Engineer), a Data Engineer and a DevOps Engineer.
It is important to understand that a Data Scientist alone can’t achieve the goals of MLOps.
- Model and Data Versioning
In a traditional software world you need only versioning code, because all behavior is determined by it. In ML things are a little different. In addition to the familiar versioning code we also need to track model versions, the data used to train it, and some meta-information like training hyperparameters.
ML models are harder to test than DevOps one, because no model gives absolutely accurate results. This means that model validation tests need to be surely statistical in nature, rather than having a binary pass/fail status.
It’s also not enough to track one metric for the entirety of the validation set.
A good data pipeline usually starts by validating the input data. In addition to basic validations that any data pipeline performs, ML pipelines need higher level validation statistical properties of the input. For example, if the average diversion of a feature changes considerably from one training dataset to another, it is likely to affect the trained model and its predictions.
For ML systems, monitoring becomes even more important than monitoring production systems. It is because their performance depends not just on factors that we have some control over, like infrastructure and our own software, but also on data, over which we have less control. Therefore, in addition to monitoring standard metrics like latency, traffic, errors and saturation, we also need to monitor model prediction performance.
The Future of MLOps Infrastructure
MLOps is a growing area that lacks competencies and will gain momentum in the future. In the meantime, it is advisable that the best practices and DevOps practices should be employed. The main goal of MLOps is to use ML models more efficiently to solve business problems. But this raises many questions:
- How to shorten the time for launching models into production?
- How to reduce bureaucratic friction between teams of different competencies and increase the focus on collaboration?
- How can you track models, manage versions, and organize effective monitoring?
- How to create a truly cyclical life cycle for a modern ML model?
- How to standardize the machine learning process?
The answers to these questions will largely determine how quickly MLOps will reach its full potential.
Geniusee Experience in How to Build a MLOps Solutions
Not so long ago, we began to build a competency structure and MLOps processes. However, our projects on model lifecycle management and on the application of models as a service are already in the MVP testing stage.
We also determined the structure of competencies that is optimal for a large enterprise and the organizational structure of interaction between all participants in the process. Agile teams were organized to solve problems for the entire spectrum of business customers, as well as a process of interaction with project teams to create platforms, infrastructure, which is the foundation of the MLOps building under construction.
Based on our experience, we can define the most significant Agile approaches pros of MLOps to build a MLOps solution in relation to the specifics of industrial Machine Learning deployment:
- Shorter lead times for quality results through reliable and efficient machine learning lifecycle management;
- reproducible workflows and models thanks to Continuous Development / Integration / Training (CI / CD / CT) methods and tools;
- easy deployment of high-precision ML models anywhere and anytime;
- system of integrated management and continuous monitoring of machine learning resources;
- elimination of organizational barriers and pooling the experience of multidisciplinary ML-specialists.
Thus, speaking about pros of MLOps, another great plus is that the following aspects of ML operations can be optimized using MLOps:
- unify the release cycle of machine learning models and software products created on their basis;/li>
- automate testing of Machine Learning artifacts, such as data validation, testing of the ML model itself and its integration into a production solution;
- implement agile principles in machine learning projects;
- support machine learning models and datasets for them in CI / CD / CT systems;
- reduce technical debt on ML-models.
As ML matures from research to applied business solutions, we need to improve the understanding of its operational processes.
The following table summarizes MLOps’ main practices and how they relate to DevOps and Data Engineering practices: