Hands-on Tutorials

Testing Best Practices for Machine Learning Libraries

Developing better libraries with pytest

Peng Yan
Towards Data Science
11 min readFeb 26, 2021

--

Photo by Kevin Ku from unsplash

Disclaimer: You won’t be able to fit everything to the proposed structure, apply common sense and your judgement during the testing developing and design.

These days many python libraries are built by ML researchers and practitioners. This is also true at my company, where we maintain several internal libraries and do a new release each sprint. Testing is involved whenever we want to add a new feature, fix an existing bug, or refactor the codebase. During this highly iterative process, we find that having a good testing framework save us a lots of time.

As a data scientist who is not from a software engineering background, I want to share some of the testing best practices I have discovered in this post.

We currently use pytest for testing all our internal libraries. So let’s start with some basic knowledge of pytest!

pytest Basics

Before digging into the testing strategies, some basic idea of pytest is needed. Note that this section covers and only covers the range of knowledge required to understand our testing strategies. For more information, please refer to pytest’s official documentation.

1. Folder Structure

For each of our internal libraries, we have a separate tests folder dedicated to testing. To use with pytest, the tests folder will have the following structure:

tests

| — conftest.py

| — test_some_name.py

| — test_some_other_name.py

As we can see, there is one conftest.py and several test_*.py files. conftest.py is where you setup test configurations and store the testcases that are used by test functions. The configurations and the testcases are called fixture in pytest. The test_*.py files are where the actual test functions reside. Remember, this naming convention is mandatory. Otherwise pytest will not be able to locate the fixtures and test functions.

Next we will look at the content of these two types of files to get a better idea of what fixtures are and what test functions look like.

2. Content of conftest.py

To put it simple, conftest.py is a collection of pytest fixtures that are used by test functions across different test_*.py files. Before writing any fixtures, remember to

"""
conftest.py
"""
import pytest

2.1 Configuration Fixture

First let’s look at an example of a configuration fixture, a spark configuration fixture.

"""
conftest.py
"""
@pytest.fixture(scope="session")def spark_session(request): """ fixture for creating a spark context Args: request: pytest.FixtureRequest object """ spark = ( SparkSession .builder .master("local[4]") .appName("testing-something") .getOrCreate() ) request.addfinalizer(lambda: spark.sparkContext.stop()) return spark

You should notice three things about this code snippet:

  1. pytest fixture is really just a function wrapped by the pytest.fixture decorator, and it returns the spark instance that will be used for testing.
  2. It has an optional scope argument, which specifies how long the fixture will persist. It defaults to “function”, so the fixture, spark instance in our case, will be created for each test function. Since it is an expensive operation and the spark instance can be reused by different test functions, we specify the scope to “session”, which means that it will persist for the whole testing session.
  3. Our function takes a request argument, which is a pytest built-in fixture. We use it to stop the spark instance after the testing session terminates, which is done by the line before the return statement. If your configuration do not need the teardown step, you can simply remove the request from the function signature.

2.2 Testcase Fixture

Next let’s look at the more widely used testcase fixture. Suppose our testcase is a pandas dataframe.

"""
conftest.py
"""
@pytest.fixturedef text_language_df(): return pd.DataFrame({ "text": ['hello', 'hola', 'bonjour'], "language": ["english", "spanish", "french"] })

And that’s it! It’s as simple as returning the testcase you want to use. Here we omit the scope argument so it’s default to “function”.

Next, we will look at the content of the test_*.py files, and hopefully you will see the magic of pytest.

3. Content of test_*.py

Now suppose we want to test our language detection function, which takes in a text string and returns the most possible language it is written in.

So in our test_language_detection.py, we will have this code snippet:

"""
test_language_detection.py
"""
# import the detect_language function heredef test_detect_language(text_language_df): for i in range(len(text_language_df)): assert detect_language(text_language_df.text[i]) == text_language_df.language[i]

You should notice two things about this code snippet:

  1. The test function’s name starts with “test”. This is required for a test function to be visible to pytest when it is invoked. One trick that makes use of this property is to prepend an underscore to the name of the test function you want to skip for now.
  2. text_language_df is the fixture you declared in conftest.py. Without any import or extra overhead, you can use it in any of the test functions in any of the test_*.py files. You can treat it as a normal pandas dataframe.

Now you will understand why we say conftest.py is “a collection of pytest fixtures that are used by test functions across different test_*.py files”. These fixtures are defined once and used everywhere.

In fact, pytest also allows you to create pytest fixtures in each test_*.py file. We find it best to put fixtures that are only used by this one test_*.py file there, so that the conftest.py is not overwhelmed by fixtures.

4. pytest CLI

pytest is invoked in CLI. The most direct way is to call

$ pytest tests/or$ pytest tests/test_language_detection.py tests/test_something.pyor$ pytest tests/test_language_detection.py::test_detect_language tests/test_something.py

For more info on specifying tests/selecting tests, please refer to the official documentation.

Common Testing Strategies

In this section, we will discuss common testing strategies we developed in testing our internal libraries. The core idea behind these strategies is to enable faster iteration.

0. Tests Classification

All the tests we have run for our internal libraries can be roughly divided into three categories based on different granularities:

  • unit tests focus on particular methods or functionality which don’t rely on other untested components;
  • integration tests deal with complex flows and interactions that involve several units. They almost always rely on some mocked functionality for faster iteration;
  • end-to-end tests, as opposite to the previous category, don’t take advantage of mocked functionality. They test the whole feature as it is, all the dependencies are present and set up.

1. Testing Workflow

During our library development, we find two types of workflow that are extremely common, namely testing workflow for new code and testing workflow for bug fixing code.

1.1 Testing Workflow for New Code

Image by author

For adding new code, you need to implement three levels of tests if applicable. The code has to pass all three levels of testing. Failing on a higher level restarts the code from the lowest level.

1.2 Testing Workflow for Bug Fixing Code

Image by author

For adding code that fixes a bug, we highly recommend adding tests before making changes to the existing code. You should expect the added tests to fail before the bug is fixed, and expect them to pass after the bug is fixed. Through this way, the tests can serve as regression tests which prevent us from accidentally reproducing the same bug again in the future. Another takeaway is that always running and passing all tests before merging in the code. If no CI/CD tools are available, you have to enforce this rule manually.

Next we will look at two testing strategies we use for shorter testing waiting time and faster tests update.

2. Parametrizing Testcases

In the previous section, we show how we can create a testcase fixture in conftest.py and then reuse it in our test functions. The testcase fixture is a pandas dataframe which consists of a list of testcases. One drawback of using dataframes (in fact any collection data structure such as list, tuple, set) is that when one of the testcases in the dataframe fails, the entire test function will be marked as fail. It’s hard to find out which single testcase fails the function. What is more inconvenient, if the test function is computationally expensive, you will not be able to get the testing result for all testcases in one run, if one of the testcases fails. You have to first fix the failed testcase, rerun pytest, and repeat this routine if another fail occurs.

Fortunately, pytest provides several ways of parametrizing testcases so that each testcase is treated separately and you can get all their results in one run.

For testcases that are used by one test function

This is done with the @pytest.mark.parametrize decorator, which is applied on the test function directly. Let’s see how it works with our language detection testing example.

"""
test_language_detection.py
"""
@pytest.mark.parametrize( "text,expected", [ ('hello', 'english'), ('hola', 'spanish'), ('bonjour', 'french') ])def test_detect_language(text, expected): assert detect_language(text) == expected

It is quite self-explaining. In one run, the test function will be called three times, so that we can get the result for each testcase separately.

For more info on parametrizing test functions, please refer to the official documentation.

For testcases that are used by multiple test functions

In this case, instead of parametrizing on test functions, we parametrize the fixture function.

"""
conftest.py
"""
import pytestfrom collections import namedtupleTestCase = namedtuple("TestCase", ["text", "expected"])@pytest.fixture( params=[ TestCase("hello", "english"), TestCase("hola", "spanish"), TestCase("bonjour", "french") ])def test_case(request): return request.param

Then in the test function, we can use the parametrized fixture as follows:

"""
test_language_detection.py
"""
def test_detect_language(test_case): assert detect_language(test_case.text) == test_case.expected

You should notice two things about this code snippet:

  1. The built-in request fixture is responsible for coordinating the parametrization. It is somewhat counterintuitive at first glance, but you can think of it as simply a “syntax” for parametrizing.
  2. The params argument in pytest.fixture takes in a list. Here we define each item in the list as a namedtuple to avoid hard coded indices or strings. Using a namedtuple enables us to refer to the testcase’s input and output as test_case.text and test_case.expected later in the test function. Instead, if you have items having the form of [“hello”, “english”], then you have to refer to them as test_case[0] and test_case[1] in the test function, which is not a good programming practice

For more info on parametrizing fixtures, please refer to the official documentation.

Another implicit benefit of parametrizing testcases is that it enables updating existing tests with new testcases easily. We find this extremely useful in the testing workflow for bug fixing code. For example, suppose a user reports to us that the detect_language function incorrectly assigns “nǐ hǎo” to “vietnamese”, which should be “chinese”. Following the testing workflow for bug fixing code, we add the regression tests first. As the testcases are already parametrized, this can be done by simply adding a tuple (“nǐ hǎo”, “chinese”) into the list if we are using @pytest.mark.parametrize, or adding a TestCase(“nǐ hǎo”, “chinese”) if we are parametrizing the fixture function. It would be much harder to achieve the same effect if the testcases are not parametrized.

3. Mocking Complex Classes

When writing tests for a class from our internal libraries, one common situation we run into is that some abstract methods of the class haven’t been implemented yet. These abstract methods are intended for our end users, namely data scientists, to implement based on their use cases. The absence of the implementation prevents us from instantiating the class and testing it. The workaround we find is to subclass the class, explicitly implement the abstract methods, but leave the method body empty.

For example, suppose we want to test a Classifier class, which has three abstract methods: load_dataset, load_model, compute_metrics. As our testing scope is to make sure a Classifier instance functions correctly for general use cases, we don’t want to introduce any specific dataset or model. We create a new class, MockedClassifier, that subclasses Classifier and explicitly implement these abstract methods.

"""
test_classifier.py
"""
class MockedClassifier(Classifier): def load_dataset(self, *args, **kwargs): pass def load_model(self, *args, **kwargs): pass def compute_metrics(self, *args, **kwargs): pass

Then we can use MockedClassifier instead of Classifier to test its functionality. For example, to test its instantiation

"""
test_classifier.py
"""
def test_instantiation(): trainer = MockedClassifier("init args here")

Another situation when mocking is useful is when the class you want to test has some computationally expensive operations that are unrelated to the testing scope. You can subclass it and override the expensive operations with much lighter ones. For example,

"""
test_class_with_expensive_operation.py
"""
class MockedClass(ClassWithExpensiveOP): def some_expensive_operation(self, *args, **kwargs): # code for lighter operation here

Advanced pytest

In the last section, we will cover some of the advanced pytest techniques we find useful from past experience.

1. Passing in Arguments from Command Line

Sometimes we may want to pass in some arguments from command line to control the testing behavior. For example, suppose we want to test loading a file from some file path. The file path varies from platform to platform. To make our tests portable, we can pass in a platform argument from the command line and set the file path based on it.

To add a command line argument,

"""
conftest.py
"""
import pytestdef pytest_addoption(parser): parser.addoption( "--platform", action="store", default="platform_0", choices=["platform_0", "platform_1", "platform_2"], help="The name of the platform you are on")@pytest.fixturedef platform(pytestconfig): return pytestconfig.getoption("platform")

Now we can call

$ pytest tests/ --platform platform_2

And the platform fixture will stores the platform info we type in from the command line. It defaults to “platform_0” if no explicit platform is given.

Next we add in the file path fixture, which is determined by the platform we are on.

"""
conftest.py
"""
@pytest.fixturedef filepath(platform): if platform == "platform_0": return "the file path on platform_0" elif platform == "platform_1": return "the file path on platform_1" elif platform == "platform_2": return "the file path on "platform_2""

Finally, we can test loading with this filepath fixture.

"""
test_file_load.py
"""
def test_load_file(filepath): with open(filepath, "r") as f: f.read()

2. Temporary Directories and Files

Some of our library code will write to disk files when running. For example, when testing our experiment tracking library, it will write logs, metrics, etc. to disk files. pytest provides several temporary directories and files fixtures aiming for this use case. They already have a very comprehensive documentation here.

3. mocker.spy

When testing some classes, we want to make sure that not only the result is as expected, but also the number of function calls made is as expected. To enable this powerful functionality, we need to install pytest-mock, which is a pytest plugin that provides a mocker fixture. Currently we only make use of its spy utility, a clear documentation with a simple example can be found here.

4. Fixtures Decompostion

One way of working with fixtures is to put fixtures that are shared among multiple test_*.py files in the conftest.py, leaving the fixtures which are specific to a test module in there. One potential drawback of this approach is that it results in enormously large conftest.py, which is hard to navigate and can lead to merge conflicts even though folks are working on different testing modules.

In that sense, when the conftest.py becomes too large that the downside of inefficient navigation and collaboration has outweighed the benefit of centralization, it should be split into multiple fixture files. For example, one fixture file for dataset fixtures, one fixture file for configuration fixtures, and so on.

In fact, pytest provides a way to do this without sacrificing the benefit of sharing fixtures among different testing modules. After splitting the conftest.py into several fixture files, you can include them back into conftest.py as plugins.

To be more specific, suppose we have a dataset_fixtures.py and a config_fixtures.py that look like this:

"""
dataset_fixtures.py
"""
import pytest@pytest.fixturedef dataset_fixture_0(): # some code here@pytest.fixturedef dataset_fixture_1(): # some code here"""
config_fixtures.py
"""
import pytest@pytest.fixturedef config_fixture_0(): # some code here@pytest.fixturedef config_fixture_1(): # some code here

Then to include them back into conftest.py, only need to add one line

"""
conftest.py
"""
import pytestpytest_plugins = ["dataset_fixtures", "config_fixtures"]# code for fixtures in conftest.py here

And that’s it! These testing practices are quite easy to carry out, yet they prove to be really handy throughout our development cycle. Hope they will help you develop python libraries in a faster and more robust way :)

--

--