Shruti Turner.

Testing 1, 2, 3

MLOpsData ScienceData ScientistMachine Learning EngineerML Engineering

When we're focused on writing our code and getting it running, it can feel like an additional chore to have to then write tests for it. I mean, we've just run it and it's working so what's the point?

Well, the point is testing is important and I'm going to try and give an introductory overview of the what and why, and touch a little on the how, i.e. implementation, but leave the details and "how to" for another post!

Generally, Why?

Testing code is a key part of MLOps and code deployment generally. Whether you're directly working in these spaces or you're developing code/models that will then be deployed, testing is a really important thing to understand. At the very least the why and what, even if not the how.

Sticking to the examples within Machine Learning and Data Science here, we want to test our models for various things. Writing tests for your code is more than just about checking whether it runs, but whether each function is actually doing what we expect it to do.

The biggest why, other that for the sake of doing "good science" is that most of us will be using these Machine Learning models to have an impact. Whether to increase revenue for a business or help diagnose patients. We don't want to be using models where we are wondering why the results aren't as we expect. Or worse, we run the model which throws no errors and use poor predictions to base business decisions on.

Let's dive into some more of the "whats" we should be testing with some more specific whys...

What Are We Testing?

Largely, the tests we do can be grouped into two larger categories: Data and Feature Quality, and the Machine Learning Model itself, including its implementation. Below I'll go into more details about each, again remember this is an overview with prompts for what to think about rather than an exhaustive guide.

1. Data and Feature Quality

For any code, it's important it works, and as I said above, that's more than just making sure it runs on your device with the specific inputs you've put in. That's just one example that works, it doesn't mean your code will work for *any* data that's fed to it. What if your model is deployed and runs when new data is available? You probably won't be manually checking that data every time.

Machine Learning models are picky things, they require the data to be in the right format and many models can't deal with missing data. These are some of the first things we can test for. Not only this, but poor data quality is a key factor in poorly performing models. So what can we do about this? Well, we can test for it. For instance, for many time series methods we would want the data to be stationary, so let's check that before we run the model with that data.

We may be interested in checking for the properties of the input data each time the model is run, to make sure there aren't significant changes i.e. dataset drift, that can affect the performance of the model.

2. Machine Learning Model

There are two aspects to the model that you might be testing: 1) performance 2) software. Both of these aspects are important, but one or the other may be more relevant to what your focus is. For instance, as someone creating the models both are important. The model needs to perform well so we need to check for this when we are developing the model, but also we need to make sure that the model is robust in terms of its function. For an ML Engineer working in a company where the Data Scientists develop the models and the role is more MLOps focused, the functionality might be the focus of your testing as model performance is part of the model development.

1. Performance

Testing the performance of a model is important for similar reasons to checking code quality: you don't want to have a model that runs code wise but isn't accurate. We don't want to be making decisions based on poor data we're getting from models, many of which can have devastating real-world impact to people and businesses. I won't go into the details here of model metrics, but it's important to know how you're evaluating your model and the threshold of performance you are after.

Part of the performance of the model is whether it is reliable, both in terms of the Data Science concepts but also the business goals: does the model align to the objectives set? Is the model still relevant as it is or should it be retrained? When? The point is, there is so much that goes into a model performing well and these should be tested for.

Is the model reproducible, i.e. with the same data and model do you get the same results? Are the different functions being implemented correctly? Are unexpected inputs/errors being handled correctly? Are the algorithm being used correct?

2. Software

Now, I've mentioned that S word...software. In the end, as much as we can focus on the statistics and models, in the real world we are writing these models to be deployed for impact. In other words, we are writing software even if that's not what we are calling it. As such, we have to test the software side of things. Testing the "software side" of things is as important in Machine Learning. We just have to think about these from a data perspective as well as a software one. Do the API calls work? Can they cope with the load? Does the model load correctly in the production environment?

We don't want to have a model that works fantastically in isolation, but doesn't work when it's deployed - what's the point? The place where ML adds value to the business is in production, so let's make sure it's working there.

But, How?

Potentially, the most complex question to answer because even a broad brush approach has many parts. Depending on things we are testing for, depends on what type of test we want to do. Such a non-answer, I know. I'll try and break it down a little...

There are three main ways to write your test:

1. Unit Tests

These tests a specific piece (or unit) of your code. Usually you will have a unit test that tests a specific function performs as expected, or a single task.

2. Integration Tests

These test multiple tasks in one test, for instance it will test your codes ability to clean your data, that may be several functions.

3. End to End Tests

Here we are writing a test to hit all levels of your software (from database to API, in most ML cases), you're hopefully hitting the real endpoint with a specific input and testing whether the response is as expected.

These types of tests can be written for 3 main purposes:

1. System Tests

Here we're looking at whether the design of the system we've created is doing what we expect. i.e. testing the logging behind the behaviour and how it compares to what we expect to be happening.

2. Regression Tests

Despite the name, we're not talking about statistics here rather these tests will check that a change or addition made in the code won't break the existing functionality.

3. Acceptance Tests

This is where we want to test that the original requirements of the problem have been met by our proposed solution. This is also referred to as User Acceptance Testing (UAT).

That's a lot of information, I hope you've stuck with me to this point! Testing is such a huge topic, but a vital one to get your head around for the robust delivery of ML solutions. This is only a light touch, we haven't gone into when or how these tests can or should be delivered but hopefully if you've got this far you feel like you've got a good grasp on the what and why of testing in Machine Learning. In my view, that's a bigger hurdle than learning how to implement the code!

Share Now



More Stories

Cover Image for Tickets, Please?
Ways of WorkingTicketsAgileScrumKanban

Tickets are the building blocks that make up a team’s work, without clearly defined blocks it’s difficult to work efficiently and effectively as a team, catching gaps and avoiding duplication of work.