In 2016 we published blog posts (I, II) about Schemaless – Uber Engineering’s Scalable Datastore. We went over the design of Schemaless as well as explained the reasoning behind developing it. In this post today we are going to talk about the evolution of Schemaless into a general-purpose transactional database called Docstore.
Docstore is a general-purpose multi-model database that provides a strict serializability consistency model on a partition level and can scale horizontally to serve high volume workloads. Features such as Transaction, Materialized View, Associations, and Change Data Capture combined with modeling flexibility and rich query support, significantly improve developer productivity, and reduce the time to market for new applications at Uber.
Docstore is currently in production and is serving business-critical use cases.
Schemaless was originally designed as an append-only datastore. The smallest entity is called a cell and it is immutable. Removing mutability reduced the complexity of the system and made it less error prone. However, over time we realized that the restrictive API and modeling capabilities made it hard for users to use as a general-purpose database.
The shortcomings of Schemaless resulted in the introduction of Cassandra, which did offer a lot of flexibility and ease of use. However, Cassandra had other shortcomings. Uber has a significant data footprint and at that scale, scalability and efficiency must go hand in hand. We found Cassandra to be lacking operational maturity at Uber’s scale and at the same time it did not provide the desired level of efficiency. The eventual consistency offered by Cassandra also ended up impeding developer productivity as they had to design around the lack of strong consistency and this complicated the application architecture.
Having first-hand experience developing and operating Schemaless and Cassandra, we came to the conclusion that evolving Schemaless into a general-purpose transactional database would be the best choice. Schemaless had traditionally been a highly reliable system but now we needed to focus on usability while achieving similar or better reliability!
We did not want to build a NoSQL system. We wanted to provide the best of both worlds: schema flexibility of document models, and schema enforcement seen in traditional relational models.
We have designed tables in Docstore to enforce a schema on the data. The applications that work with the data usually assume some kind of structure. This means either they utilize schema-on-read in which case the application interprets the data when it is read or they utilize schema-on-write, which ensures that schema is explicit and the database ensures schema conformity of data. We support the latter approach of “schema-on-write” by default.
Together with the schema enforcement described above, Docstore also provides schema flexibility. Schemas can evolve. Records with different schemas are allowed to co-exist and schema updates do not require full table rebuilds. Sparseness and support for complex nested data types are first-class features.
Docstore has the following features built-in out of the box. It integrates with the Uber software ecosystem and can be provisioned with the single-click of a button.
Docstore has a layered architecture and a Docstore deployment is called an instance. Each instance is divided into a query engine layer, a storage engine layer, and a control plane.
The query layer is stateless and is responsible for routing the requests to the storage layer.
The storage engine, responsible for storing the data, is organized as a set of partitions, and data is distributed across these partitions. The control plane is responsible for assigning shards to Docstore partitions and adaptively adjusts the shard placement in response to failure events.
Docstore has a concept of tables. Tables look like relational database tables in that they are structured with rows, columns, and values. There are no restrictions on how tables are modeled in Docstore and with the use of user defined types, Docstore can store nested records as rows. This is useful, for example, when the data has a document-like structure where the entire hierarchy is loaded at once. Docstore also provides support for “Associations”, which allows for representation of one-to-many and many-to-many relationships. We call this the Flexible Document Model as it allows modeling both relational and hierarchical data models. We will cover data modeling in Docstore in the second part of this blog series.
Each table can have one or more materialized views. A materialized view is a view that allows the data to be partitioned differently as compared to the main table, by using different columns. Adding a materialized view partitioned by a non-primary key column makes it efficient to look up data by that column and allows different query access patterns.
Every table must have a primary key and that primary key can be composed of one or more columns. The primary key identifies the row in the table and enforces a unique constraint. Internally, the primary key and partition key columns are stored as a byte array and the value is derived by doing the order-preserving encoding of the key column values. Docstore stores rows in sorted order by primary key values. This combined with composite partition keys enables sophisticated query patterns, including grabbing all rows with a given partition key or using the remainder of the primary key to narrow the relevant rows for a particular query.
Naturally, for us, the next step in our design process was to deal with implementing the sharding logic. Tables are sharded and distributed across multiple shards – transparently to the application. Each shard, representing a set of rows from a table on the order of a few hundred gigabytes, is assigned in its entirety to a partition. A partition can contain one or more shards.
The important design consideration was to let applications control the data locality through their choices of keys. That’s how we introduced partition keys in addition to primary keys. An application can choose to explicitly define partition keys in the schema otherwise Docstore uses primary keys to shard the data.
Typically, there are multiple partitions in each Docstore instance. To eliminate a single point of failure, a partition is a group of 3-5 nodes and each node is a unit of physical isolation and is deployed to an independent zone. Each partition is replicated to multiple geographic locations to provide resilience to data center failures.
Replicated state machine
To ensure consistency, each partition runs the raft consensus protocol. There is a single leader and multiple followers.
All writes are initiated by the leader. The consensus protocol implemented is used to keep the replicated log consistent across the nodes in the partition. This guarantees that all the nodes in a partition contain the same writes in the same order ensuring serializability. The state machine running on each node only proceeds in committing the write if and only if consensus is reached. This gives a very nice property that if a write to a key is successfully committed, all the subsequent reads via the same key return the same data written by this particular operation or some later operation.
Docstore provides a strict serializability consistency model on a partition level. This provides a nice property to the user where they can imagine that transactions are getting executed sequentially. Transactions are ordered such that a transaction ‘A’ started and committed before transaction ‘B’, always occurs before B. This ensures that a read operation always returns the result from the most recent write. From a CAP theorem terminology, Docstore favors consistency over availability and therefore is a CP system.
Docstore uses MySQL as the underlying database engine. The unit of replication in the Replicated State Machine is a MySQL transaction. All the operations are executed within the context of a MySQL transaction to guarantee ACID semantics. These transactions are then replicated across the nodes using the raft consensus protocol.
We depend on MySQL for concurrency control. It is important to realize that MySQL relies on row locks for concurrency control of write operations (insert, update, deletes). Thus, the concurrent updates to the same rows are effectively serialized by MySQL, and by the time the control flow reaches the client issuing the commit, all the locking has been taken care of.
We use the flowchart in figure 7 to show what happens when two transactions interleave in time. We use the boxes at the different positions on the time axis to express the interleaving, i.e. different boxes correspond to different ‘events’ in time.
A raft based replicated state machine implementation allows for exposing MySQL’s transaction to clients in a highly-available way, i.e. all the replicas coordinate with each other to apply the transactions so that automatic failover among replicas is possible while the ACID property of transactions are still maintained even if failover happens. Note that because we rely on exposing MySQL’s transactions to clients, all of the benefits and the constraints of MySQL transactions are inherited.
In this article, we have covered the genesis and the motivation behind Docstore. We have also dived deep into the architecture and explained how transactions are handled in Docstore. In the next part of this multi-part series, we will focus on data modeling and schema management. We will cover how Docstore can support hierarchical and relational models and what type of applications should choose these data models. In the third and concluding part of this series we will do a detailed drilldown of Materialized Views in Docstore. This includes the motivation, the MV refresh framework and how we plan to leverage MVs even if it is not explicitly mentioned in the query.
If you enjoyed this article and want to work on systems and challenges related to globally distributed systems that power the operational storage of Uber, please explore and apply to open positions in the storage team on Uber’s career portal.
Ovais is a Sr. Manager in the Core Storage team at Uber. He leads the Operational Storage Platform group with a focus on providing a world-class platform that powers all the critical business functions and lines of business at Uber. The platform serves tens of millions of QPS with an availability of 99.99% or more and stores tens of Petabytes of operational data.
Deba works as a Sr. Product Manager for the Uber Infrastructure team. Before Uber, Deba has worked at database startups and at Oracle in various product management roles. Prior to moving into product management Deba was responsible for managing the performance of large data warehouses. Deba has a Master’s in Technology Management from UPenn.
Himank is the Tech Lead of Docstore at Uber. His primary focus area is building distributed databases that scale along with Uber's hyper-growth. Prior to Uber, he worked at Yahoo in the mail backend team to build a metadata store. Himank holds a master's degree in Computer Science from the State University of New York with a specialization in distributed systems.
Posted by Ovais Tariq, Deba Chatterjee, Himank Chaudhary
Cinnamon: Using Century Old Tech to Build a Mean Load Shedder
22 November / Global
Real-Time Analytics for Mobile App Crashes using Apache Pinot
2 November / Global
NilAway: Practical Nil Panic Detection for Go
Cinnamon: Using Century Old Tech to Build a Mean Load Shedder
PID Controller for Cinnamon
Up: Portable Microservices Ready for the Cloud