Unit testing is an important part of modern, collaborative software development. Especially as the number of project contributors grows, rigorous unit test coverage helps monitor and enforce high quality. Having a good system in place to generate test cases is important to identify difficult edge cases in your code.
We use NumPy and PyTorch for building many machine learning (ML) models at Uber AI. Our internal hyper-parameter tuning service makes heavy use of PyTorch and has tensor values as inputs to its functions.
To make unit testing easier for these ML models, we introduce Hypothesis GU Func, a new open source Python package created by Uber. An extension to the Hypothesis package, Hypothesis GU Func allows property-based testing of vectorized NumPy functions. This tool has been useful in finding bugs in tools developed internally at Uber AI Labs, and now, with its open source release, can be leveraged by the broader ML community.
Hypothesis and property-based testing
The repetitive and arbitrary nature of ML models makes standard unit testing difficult. The most common type of unit test is called a golden test, demonstrated below:
Scientists often write golden tests for machine learning code because they feel there are few alternatives: It is hard to specify what the correct output should be.
However, property-based testing works as an alternative. With this method of unit testing, a user generates many cases, attempting to cover the whole space, and then tests that desired properties are obeyed, a technique often referred to as auto-generated tests or fuzz testing.
Hypothesis is a popular Python package (with ports in other languages) for property-based testing. This library provides a given decorator that lets a strategy build the test cases. Hypothesis has shown that property-based testing is a highly effective form of unit testing for finding bugs in edge cases. Writing unit tests in this form manifests in the following way: to test a function foo, one could write:
For instance, the property could be that the output z is equal to the output from a slower implementation, foo_slow. Sometimes other properties such as “z is sorted” are applicable; or, one could call an inverse function. The classic example from the Hypothesis quick start is an encoder-decoder test:
This strategy does not include support for functions that specify their allowed inputs via a NumPy function signature, making debugging more difficult.
Rather than test discrete code, as shown above, lots of machine learning software relies on NumPy (or similar tensor libraries like PyTorch or TensorFlow). The Hypothesis package supports test case generation for NumPy. In fact, the original contribution for NumPy support in Hypothesis came from Stripe. In an article on Stripe’s blog, former Stripe engineer Sam Ritchie wrote that “Hypothesis is the only project we’ve found that provides effective tooling for testing code for machine learning, a domain in which testing and correctness are notoriously difficult.”
The hypothesis.extra.numpy functionality supports unit testing strategies like:
Hypothesis GU functions extension
Hypothesis’ NumPy support is not good enough for our purposes, because we want to test across variable size inputs and obey mutual size constraints. So we decided to write our own extension.
Most NumPy-based functions have mutual dimension compatibility constraints between arguments. For instance, the np.dot function takes a (m,n) array and a (n,p) array. NumPy has developed the notion of a function signature in its general universal (GU) function API. For instance, the np.dot signature is ‘(m,n),(n,p)->(m,p)’.
This constraints lets us generate test cases that we could not easily generate before, merely by specifying the function signature.
Taking this functionality a step further, Hypothesis GU Func can take a function signature and define a strategy that generates test cases compatible with the signature. From our documentation, we have the test:
Another difficult functionality to test is broadcasting. We would like to test that any vectorized functions handle broadcasting correctly. This test is especially important since modern batched training methods often involve some broadcasting and the conventions can be somewhat complicated. For instance, the Pyro team at Uber has noted that at least 50 percent of the bugs in Pyro are due to broadcasting errors.
NumPy defines a convention for how vectorization should be performed. As an example, if a user pads the input with extra dimensions, for instance (3,2,m,n) instead of (m,n), the vectorized code should be equivalent to looping over the extra dimensions. There is a convention defined by the np.vectorize function specifying the correct way to vectorize. Hypothesis-gufuncs strategies can generate inputs with extra dimensions that are broadcast-compatible, which is useful when testing that vectorization has been done correctly. Hypothesis-gufuncs allows users to test broadcasting conventions with the following code:
Providing max_dims_extra=3 gives up to three broadcast compatible dimensions on each of the arguments.
Finding bugs in released code
By testing our own routines with these strategies, we independently rediscovered open bugs in NumPy itself. For instance, corner cases like NumPy issue #9884, which reported problems with the case: np.unravel_index(0, ()). We also found broadcasting issues in NumPy itself in issue #7014, which showed inconsistent broadcasting in cases like np.isclose(0, np.inf) not returning a scalar.
Torch, an increasingly popular open source machine learning library, has been an important component of Uber AI’s unit testing stack. To support this element of our codebase, we added Torch functionality for Hypothesis GU Func.
Users can generate Torch variables for testing with the following code:
To use Hypothesis GU Func, check out our documentation and simply install the extension with this pip command:
pip install hypothesis-gufunc
At Uber AI, Hypothesis GU Func is used by researchers to unit test edge cases more quickly and efficiently, leading to faster ML insights and better models.
We are looking for contributors and hope users take this package out for a test drive.
Ryan Turner is a Staff Software Engineer at Uber.
Posted by Ryan Turner
Selective Column Reduction for DataLake Storage Cost Efficiency
September 20 / Global
CheckEnv: Fast Detection of RPC Calls Between Environments Powered by Graphs
September 13 / Global