
Uber Engineering’s Schemaless storage system powers some of the biggest services at Uber, such as Mezzanine. Schemaless is a scalable and highly available datastore on top of MySQL¹ clusters. Managing these clusters was fairly easy when we had 16 clusters. These days, we have more than 1,000 clusters containing more than 4,000 database servers, and that requires a different class of tooling.
Initially, all our clusters were managed by Puppet, a lot of ad hoc scripts, and manual operations that couldn’t scale at Uber’s pace. When we began looking for a better way to manage our increasing number of MySQL clusters, we had a few basic requirements:
- Run multiple database processes on each host
- Automate everything
- Have a single entry point to manage and monitor all clusters across all data center
The solution we came up with is a design called Schemadock. We run MySQL in Docker containers, which are managed by goal states that define cluster topologies in configuration files. Cluster topologies specify how MySQL clusters should look; for example, that there should be a Cluster A with 3 databases, and which one should be the master. Agents then apply those topologies on the individual databases. A centralized service maintains and monitors the goal state for each instance and reacts to any deviations.
Schemadock has many components, and Docker is a small but significant one. Switching to a more scalable solution has been a momentous effort, and this article explains how Docker helped us get here.
Why Docker in the first place?
Running containerized processes makes it easier to run multiple MySQL processes on the same host in different versions and configurations. It also allows us to colocate small clusters on the same hosts so that we can run the same number of clusters on fewer hosts. Finally, we can remove any dependency on Puppet and have all hosts be provisioned into the same role.
As for Docker itself, engineers build all of our stateless services in Docker now. That means that we have a lot of tooling and knowledge around Docker. Docker is by no means perfect, but it’s currently better than the alternatives.
Why not use Docker?
Alternatives to Docker include full virtualization, LXC containers, and just managing MySQL processes directly on hosts through for example Puppet. For us, choosing Docker was fairly simple since it fits into our existing infrastructure. However, if you’re not already running Docker then just doing it for MySQL is going to be a fairly big project: you need to handle image building and distribution, monitoring, upgrading Docker, log collection, networking, and much more.
All of this means that you should really only use Docker if you’re willing to invest quite a lot of resources in it. Furthermore, Docker should be treated as a piece of technology, not a solution to end all problems. At Uber we did a careful design which had Docker as one of the components in a much bigger system to manage MySQL databases. However, not all companies are at the same scale as Uber, and for them a more straightforward setup with something like Puppet or Ansible might be more appropriate.
The Schemaless MySQL Docker Image
At the base of it, our Docker image just downloads and installs Percona Server and starts mysqld—this is more or less like the existing Docker MySQL images out there. However, in between downloading and starting, a number of other things happen:
- If there is no existing data in the mounted volume, then we know we’re in a bootstrap scenario. For a master, run mysql_install_db and create some default users and tables. For a minion, initiate a data sync from backup or another node in the cluster.
- Once the container has data, mysqld will be started.
- If any data copy fails, the container will shut down again.
The role of the container is configured using environment variables. What’s interesting here is that the role only controls how the initial data is retrieved—the Docker image itself doesn’t contain any logic to set up replication topologies, status checking, etc. Since that logic changes much more frequently than MySQL itself, it makes a lot of sense to separate it.
The MySQL data directory is mounted from the host file system, which means that Docker introduces no write overhead. We do, however, bake the MySQL configuration into the image, which basically makes it immutable. While you can change the config, it will never go into effect due to the fact that we never reuse Docker containers. If a container shuts down for whatever reason, we don’t just start it again. We delete the container, create a new one from the latest image with the same parameters (or new ones if the goal state has changed), and start that one instead.
Doing it this way gives us a number of advantages:
- Configuration drift is much easier to control. It boils down to a Docker image version, which we actively monitor.
- Upgrading MySQL is a simple matter. We build a new image and then shut containers down in an orderly fashion.
- If anything breaks we just start all over. Instead of trying to patch things up, we just drop what we have and let the new container take over.
Building the image happens through the same Uber infrastructure that powers stateless services. The same infrastructure replicates images across data centers to make them available in local registries.
There’s a disadvantage of running multiple containers on the same host. Since there is no proper I/O isolation between containers, one container might use all the available I/O bandwidth, which then leaves the remaining containers starved. Docker 1.10 introduced I/O quotas, but we haven’t experimented with those yet. For now we cope with this by not oversubscribing hosts and continuously monitoring the performance of each database.
Scheduling Docker Containers and Configuring Topologies
Now that we have a Docker image that can be started and configured as either master or minion, something needs to actually start these containers and configure them into the right replication topologies. To do this, an agent runs on each database host. The agents receive goal state information for all the databases that should be running on the individual hosts. A typical goal state looks like this:
“schemadock01-mezzanine-mezzanine-us1-cluster8-db4”: {
“app_id”: “mezzanine-mezzanine-us1-cluster8-db4”,
“state”: “started”,
“data”: {
“semi_sync_repl_enabled”: false,