Uber uses convolutional neural networks in many domains that could potentially involve coordinate transforms, from designing self-driving vehicles to automating street sign detection to build maps and maximizing the efficiency of spatial movements in the Uber Marketplace.
In deep learning, few ideas have experienced as much impact as convolution. Almost all state-of-the-art results in machine vision make use of stacks of convolutional layers as basic building blocks. Since such architectures are widespread, we should expect that they excel at simple tasks like painting a single pixel in a tiny image, right?
Surprisingly, it turns out that convolution often has difficulty completing seemingly trivial tasks. In our paper, An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, we expose and analyze a generic inability of convolutional neural networks (CNNs) to transform spatial representations between two different types: coordinates in (i, j) Cartesian space and coordinates in one-hot pixel space. It’s surprising because the task appears so simple, and it may be important because such coordinate transforms seem to be required to solve many common tasks, like detecting objects in images, training generative models of images, and training reinforcement learning (RL) agents from pixels. It turns out that these tasks may have subtly suffered from this failing of convolution all along, as suggested by performance improvements we demonstrate across several domains when using the solution we propose, a layer called CoordConv.
Our findings are summarized in the video below:
First discovery: supervised rendering is hard for CNNs
Let’s consider a simple task of Supervised Rendering in which we give an (i, j) location as input to a network and ask it to produce a 64×64 image painted with a square centered at that location, as shown in Figure 1a. What type of network would you use to solve this task?
We could use the same approach taken by many research works that generate images and paint the square with a stack of deconvolution (transposed convolution) layers. To test this idea, we created a dataset consisting of randomly placed 9×9 squares on a 64×64 canvas, as shown in Figure 1b. Enumerating all possible fully visible squares results in a dataset with a total of 3,136 examples. To evaluate how well models generalize, we define two train/test splits: a uniform split, where all possible center locations are randomly divided 80 percent/20 percent into train vs. test sets, and a quadrant split, where the canvas is divided into four quadrants: squares centered in the first three quadrants are put in the train set and squares in the last quadrant in the test set. The distribution of both splits of the dataset are depicted in Figure 1c, below:
We assumed CNNs would trivially solve this task, both because the task is so simple (the entire dataset may be generated in only two lines of Python, as shown in our paper), and because the dataset is so small that we can easily overparameterize the model. But it turns out that CNNs perform surprisingly poorly. Even models with as many as 1M parameters and trained for over 90 minutes (Figure 2b) were unable to achieve over 0.83 test IOU on the uniform split and over 0.36 test IOU on the quadrant split (Figure 2a).
Simplified task and second discovery: supervised coordinate classification is hard for CNNs
So why is Supervised Rendering so hard? It is worth digging a little deeper to understand more fully. After all, if rendering trained with direct supervision is this difficult, it will only become more challenging when we switch to unsupervised learning, e.g. training a Generative Adversarial Network (GAN) on the same data with loss provided by a learned discriminator.
Let’s narrow the problem down to isolate what makes the problem challenging. We now ask the network to simply paint one pixel (instead of a 9×9 square). One can imagine that given a solution to this one-pixel task, a further deconvolutional network could easily expand such a pixel into a larger square, an intuition we experimentally validated. We thus arrive at the Supervised Coordinate Classification task (Figure 3a), where the dataset consists of pairs of (i, j) coordinates and images with the single corresponding pixel activated, as shown in Figure 3b, below:
We again tried lots of networks with different hyperparameters, and observed that even though some networks can memorize the training set, none of them exceed 86 percent test accuracy (Figure 4a). And that’s with more than an hour of training.
We expected convolution would work perfectly, but it doesn’t. Why not? To figure out what the network is actually doing, we take the best network trained and examine its predictions.
We asked the network to paint an image where there is only one pixel on (having a value of 1 in the one-hot representation). To see what’s going on, let’s zoom into the little region around the target pixel. In Figure 5, the target pixel is highlighted in red, and we show both the model’s softmax prediction as well as the logits. The first pixel (top row) is in the training set, so as expected the model gets it right, although some probability leaks outside of the target pixel. The next pixel to the right (middle row) is in the test set, and the model is just barely right with a neighboring pixel capturing almost as much probability. Looking at the pixel one more to the right (bottom row) shows that the model is completely wrong. This is surprising because, as a result of the 80/20 split, almost all test pixels are surrounded by training pixels.
Reversed direction and third discovery: supervised regression is also hard for CNNs
So why is highlighting one pixel given the location so hard for the network? Is it because expanding information from a small space to a large one is difficult? Would it be easier in the other direction? What if we train a convolutional network to collapse image information into scalar coordinates, more akin to ordinary image classification?
It turns out that this supervised regression task works just as poorly. In Figure 10, the dots on the left show the correct pixel coordinates, and the dots in the middle show the model’s predictions. Glossing over some details, the model does poorly on the test set and visibly struggles to predict the training set.
In short, the direction doesn’t matter.
This seemingly simple coordinate transform task causes problems for convolution in both directions: from Cartesian (i, j) space to one-hot pixel space and the other way around. Even when trained with supervision, when only one pixel is being painted, and when training examples are all around, convolution still fails to learn smooth functions between Cartesian space and pixel space. Moreover, the best performing convolutional models are huge, barely work at best, and take a long time to train.
The solution: CoordConv
It turns out there’s a simple fix for this challenge.
Convolution is equivariant, which implies that as each filter is applied to the input to generate the output, it doesn’t know where each filter is. We can assist convolution by letting filters know where they are. We do this by adding two channels to the input—one with i coordinates and one with j coordinates. We call the resulting layer CoordConv, depicted in Figure 6, below:
The proposed CoordConv layer is a simple extension to the standard convolutional layer wherein convolution is conditioned on coordinates. Allowing convolutional filters to see coordinates breaks translation equivariance, which might seem like a bad idea. Isn’t translation equivariance the signature benefit of convolution?
We argue that convolution has enjoyed success due to three important factors: it employs relatively few learned parameters, it is fast to compute on modern GPUs, and it learns a function that is translation equivariant.
The CoordConv layer keeps the first two of these properties—few parameters and efficient computation—and the degree to which it is equivariant is learned. If weights from the coordinates learn to become zero, CoordConv behaves like standard convolution. On the other hand, if translation dependence is useful for the downstream task, it’s capable of learning that, too. But as we’ll see, ultimately the proof is in the convolution.
CoordConv is related to a range of existing ideas like locally connected layers, Compositional Pattern Producing Networks, and position embeddings used in language modeling. (Check out our paper for more discussion around this concept).
CoordConv solves previous supervised tasks
First, let’s revisit the previous tasks and see how CoordConv fares.
As shown in Figures 7 and 8, CoordConv models attain perfect training and testing performances for both the Supervised Coordinate Classification and Supervised Rendering tasks, on both splits of training and test sets. Further, CoordConv models have 10-100 times fewer parameters, train in seconds rather than over an hour (150 times faster) as needed for the best-performing standard CNNs.
For a closer examination, Figure 9, below, shows a comparison of regular deconvolution vs. CoordConv when painting adjacent pixels:
When using convolution to paint pixels, we observed artifacts and overfitting. With CoordConv, performance is perfect for both our training and test sets. The same story holds in the reverse direction. Whereas convolution had trouble regressing coordinates, CoordConv models the function well, as depicted in Figure 10, below:
CoordConv helps in a wide range of domains
At this point we’ve shown a failing of convolution to solve a toy problem, and we proposed a fix in the form of the CoordConv layer. Naturally we wonder: was this issue endemic only to the toy problem, or have we found a core issue that has persisted insidiously inside other tasks, hampering performance from within? To answer this, we inserted CoordConv layers in networks trained on a variety of tasks. Below is a summary of what we found, with further details in our paper.
Because object detection models look at pixel space and output bounding boxes in Cartesian space, they seem like a natural fit for CoordConv. And we find our intuition is borne out: on a simple problem of detecting MNIST digits scattered on a canvas, we found the IOU of a Faster-RCNN network improved by about 24 percent when using CoordConv.
Of all vision tasks, we might expect image classification to show the least performance change when using CoordConv instead of convolution, as classification is more about what is in the image than where it is. Indeed, when we add a CoordConv layer to the bottom of ResNet-50 and train on ImageNet, we find only a microscopic improvement.
In generative models like GANs and Variational Autoencoders (VAEs), pixels are painted from latents, which in an ideal world might encode high level concepts like position. Intuitively, CoordConv might help here. Using a simple dataset of shapes based on Sort-of-CLEVR, we train GANs and VAEs and show interpolations between latents.
Take a simple task of generating colored shapes. Videos of interpolations, depicted in Figure 11, below for both an ordinary GAN (left) and a CoordConv GAN (right), showcase how CoordConv improves the performance of generative models.
With generative models, we use interpolations between points in the latent space as a way to investigate the impact of CoordConv, which is a common approach for evaluating how well generative models generalize.
For the ordinary GAN on the left, the animation looks fine at first. But when we observe the animation closely, we notice that not everything moves; the visual artifacts are tied to the canvas, and actually some bits of objects are simply appearing and disappearing. When we put CoordConv into both the generator and discriminator, the motion is smoother. We see objects remain coherent and move smoothly instead of teleporting.
We notice similar patterns when training VAEs. With convolution, we observe parts of the objects in the image fading in and out, but with CoordConv, objects move around more smoothly.
When using larger GANs to paint Large-scale Scene Understanding (LSUN) bedroom scenes, with convolution we again observe frozen objects fading in and out. With CoordConv, we instead see smooth geometric transformations, including translation and deformation.
RL is an interesting domain where CoordConv might help. We trained some agents to play Atari games, like Ms. Pac-Man. We thought that if a convolutional filter could simultaneously recognize Ms. Pac-Man and extract her location in the maze, it could be useful for learning better policies.
We tried adding CoordConv to our own implementation of Distributed Prioritized Experience Replay (Ape-X), but CoordConv didn’t immediately improve performance. We also tried A2C, a popular policy gradient method, and CoordConv helped more often than not. This might reflect a difference between learning explicit policies and learning Q functions. Out of nine games we tried, in six A2C with CoordConv trained faster or had better final scores than standard convolution. And as expected, we noticed a boost in Ms. Pac-Man scores. In two games, CoordConv performed similarly, and on one it performed slightly worse. Overall, these results suggest CoordConv could be useful in RL.
In this article, we have demonstrated the curious inability of CNNs to model the coordinate transform task, and we’ve introduced a simple fix in the form of the CoordConv layer. Our results suggest that including these layers can boost performance in a wide range of applications. Future work will further evaluate the benefits of CoordConv in large-scale datasets, exploring its impact in detection, language tasks, video prediction, with spatial transformer networks, and with cutting-edge generative models.
We’re curious to know if you find our research useful for your own work. If you’d like to give it a try, check out our paper and use the CoordConv layer (see section S8 for example code) as a drop-in replacement for convolution. Let us know what you discover!
Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.