Announcing Cadence 1.0: The Powerful Workflow Platform Built for Scale and Reliability
June 22, 2023 / GlobalIntroduction
We’re thrilled to announce the long-awaited release of Cadence v1.0.0, a major milestone for our team and project after over six years of development. Cadence is a powerful platform that empowers engineers to easily build and manage stateful services (a.k.a., workflows) at scale. Used by over a thousand services at Uber and in many other companies (e.g. Doordash, Hashicorp, Coinbase) in the community, Cadence is a proven and reliable workflow engine that has been battle-tested in some of the world’s most demanding distributed systems. It is a proven and robust workflow engine that scales seamlessly to handle complex scenarios, also enabling developers to build workflows using native programming language of their choice.
What is Cadence?
Cadence is an open source code-driven workflow orchestration platform supporting many companies and their critical use cases at scale.
A workflow is a set of tasks to be completed in a certain order. Workflow orchestration is managing the execution overhead of these tasks given workflow definitions. Orchestration engines handle common failures on behalf of the user. They help scale workflows efficiently and reliably. They provide tools to manage traffic and capacity. They make operating on workflows a lot easier.
Traditionally, workflows have been written with DSLs or configs defining the order and the dependencies between the tasks. While this approach made workflow orchestration simpler, it limited how much a user could do with a workflow or made DSLs and configs overly complicated over time to a level that they were no longer practical. However, simplicity should be on the workflow writing side instead of the orchestration; simply because the orchestration engine is built once, while a unique workflow needs to be written for each use case. Growing use cases and complexity proved it necessary to write workflows as freely as writing programs in native programming languages. Considering the different types of workflows exist today, it’s fair to say any program is a workflow. Instead of configs and DSL, programming is also the natural way of thinking for software engineers. Therefore, Cadence workflows were born.
A Cadence workflow is a program built with a native programming language using Cadence APIs with minimal limitations and changes compared to writing pure software. Cadence workflows complete the bridge between writing and scaling any program as easily as possible. They are durable, reliable, scalable, transactional and fault tolerant without the need of thinking about them. Out of the box, it provides APIs for external interactions, metrics for observability and user interface for inspections.
Cadence has 2 main goals for developers:
- Write any service that just works at any scale
- Remove common overheads through supporting them by default
To understand how big those goals are, let’s investigate what a typical service looks like:
Only the coding and the testing is unique to this service. Since the rest is common, it should be automated. This is exactly what happens while using Cadence.
Cadence
- Increases productivity: our internal survey from 2021 showed teams write 40% less code to implement the same functionality.
- Improves reliability: transactional guarantees, high availability at scale
- Reduces cost and complexity: shared resources within a domain, simplified workflows
- Simplifies the operations: built-in UI, metrics, logs, etc.
- Protects against human errors: versioning, replaying, and shadowing
V1 Release
Cadence is a 6+ year project. It has been released under v0.x.x until now. Before calling it v1, we had certain milestones in mind regarding its feature set, scale, and robustness. While we hit those milestones with this release, the v1.x.x branches will continue being backward compatible with previous versions.
Feature Set
Below is the list of some major features Cadence provides:
- Default APIs to workflows such as start, signal, schedule (distributed cron), terminate, cancel, etc.
- Metrics: health, volume, latency, error, heartbeat, shadow, and many more with necessary dimensions.
- Debuggability: Inputs/outputs to/from workflows and activities, API calls, execution timeline and hierarchical relationship are visible for all the workflows in Cadence UI.
- Custom visibility: Searchable attributes for workflows to filter among billions of instances and to track workflows passed certain checkpoints.
- Customer Operational endpoints: users can operate on their domains and workflows using our CLI, client APIs, or Web endpoints if needed.
- Operator observability: Cadence provides tools to prevent, detect, and manage noisy neighbors. It comes with built-in rate limits, hot shard detection, version and per-domain scale tracking.
- Scale: load-balancing, scaling support either for the whole cluster or per domain, workflow, activity, and tasklist.
- Failure modes: configurable behaviors on failures: auto retries, region failovers, and workflow resets to rewind and continue from a healthy point in case of a bad deployment.
- Versioning (backward compatibility) to manage new behavior for already in-flight workflows.
- Testing: supported via unit testing framework, replaying/shadowing against production workflows for inconsistencies, and staging before production.
- Hierarchical workflows: users can define parent-child dependencies to build complex service relationships simplified with Cadence workflows. The relationship is also visualized in Cadence Web.
- gRPC and TLS support.
- Authentication and authorization support.
- Cross-domain operations.
- Portability features to move domains from one environment to another.
Scale
Today, Cadence has been used at many major companies, with over 12 billion executions and 270 billion actions a month just at Uber. It powers over 1000 services at Uber from T0 (most critical) to T5 services. It’s used for long-running workflows, synchronous interactions, micro service orchestration, batch processing, distributed cron, distributed singleton, data pipelines, model training, and many more applications. It has been sustaining 100% year-over-year growth for several years.
Robustness
Cadence operates reliably despite its growing scale. It guarantees 99.9% availability. Within recent years, while the scale was ramping up, operational costs needed to stay flat so we took a reliability year in 2022 to achieve those goals and to healthily grow in the future. We invested into user capacity management to set expectations with our users, better traffic isolation to avoid noisy neighbor issues, and faster releases for fresher experience. Such investments helped us dramatically reduce the operational load.
With all the improvements mentioned above, we thought Cadence is now a mature enough product for a V1 release.
V2 Branch
Apart from the maturity, we also would like to offer a much more modern experience with a new V2 branch. All the changes with Cadence releases were backward compatible so far. While we are proud to maintain that with further v1.x.x releases, there are some fundamental and API changes we would like to make. They cannot happen in a backward-compatible way for good reasons and we will provide a way to upgrade from V1.
Roadmap
Roadmap transparency of the Cadence project was a popular piece of feedback in our user survey this year, so we decided to share some insights about what’s going on more transparently.
From a high level, the core Cadence team consists of 20 engineers today and it is still growing. However, the Cadence development community is much bigger than that. Internally, we have teams contributing either directly to Cadence or its underlying technologies such as Apache Cassandra®, Elasticsearch®, Kafka®, and MySQL®. Externally, many companies we work with have dedicated Cadence teams that support their companies and make Cadence contributions.
As mentioned above, 2022 was our reliability year to make our product more robust and lower the operational cost. 2023 is about observability, cost efficiency, and intuitive user experience.
We also changed our methodology to ship major features to ship them faster both internally and for open source. As we are proud to make all of our changes backward compatible, it’s a challenge to make changes when we realize issues with new APIs. Therefore, we build most major features internally first, run them internally for several months, then port to open source.
Usability is a major theme. As Cadence scales both internally and externally, we focus more on making it intuitive and easy to operate on. We heard about Cadence having a steep learning curve, so we plan to have our users write their first workflows within minutes compared to days or weeks. There will be simpler samples and tools to generate workflow templates. Nondeterministic changes will be caught during development time for much better efficiency. Then, we are revamping our documentation and making Cadence web much richer and more operational.
Observability will be a big part of near-future improvements. We will keep our Grafana® templates up to date. Capacity and backlog tracking per domain/workflow-type/tasklist, anti-pattern detection (e.g., hot shards, mistuned parameters, under-utilization), domain cost attribution, cluster and worker health checks are some of those features we are currently working on. Anomaly reporting will be integrated with Cadence-web with links and runbooks about how to investigate and resolve them. Alerting templates will be shared to explain how to monitor Cadence health.
Efficiency became a hot topic in recent years and Cadence is about to get much more efficient as well. One of our main focuses is to lower the DB load which seems to be the bottleneck most of the time. Different workflow modes will be introduced to make them run faster and cheaper. We plan to re-architect some core functionalities to distribute load evenly by design which will potentially double or triple the storage capacity. Today, capacity is managed by rate-limits; in the near future, different capacity management modes will be introduced to make both server and the client side much more efficient. With the same resources, you will be able to get a lot more from Cadence.
Community
We increased our community investments this year and we have already been witnessing a great reaction. We have seen a 90% increase in active members and 96% increase in activities in our public Slack workspace as here has been our main communication channel. Given the quick turnaround, we will continue to support our community here to make Cadence easier to use and develop.
Cadence is a growing multi-company-contributed project. We recognize a technology like this can only be perfected with better community engagement. Apart from above, we will continue to run surveys to understand the pain points of the community and respond to them timely. All the public GitHub issues are now tracked by the Cadence team and they will be resolved based on their priority. You will soon see major improvements to our documentation and samples as well.
Conclusion
While this V1 release marks Cadence as a robust, mature, scalable, and feature-rich workflow orchestration platform; this is only the beginning of our ambition: setting the standard for building next-generation stateful services at scale and with ease. It will only be possible by exposing it to as many use cases as possible and listening to the community. Automating whatever is painful, repeating, and common in an intuitive way will make building services much simpler, reliable, and more efficient.
Elasticsearch is a registered trademark of Elasticsearch BV. Oracle, Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. The Grafana Labs Marks are trademarks of Grafana Labs, and are used with Grafana Labs’ permission. We are not affiliated with, endorsed or sponsored by Grafana Labs or its affiliates.
Ender Demirkaya
Ender Demirkaya has been one of the Tech Leads of the Cadence Open Source project built at Uber. He also worked on various parts of search engines between 2008 and 2021 through research or work including Meta and Microsoft / Bing.
Posted by Ender Demirkaya
Related articles
Most popular
Modernizing Logging at Uber with CLP (Part II)
Sparkle: Standardizing Modular ETL at Uber
Introduction to Kafka Tiered Storage at Uber
Charting the mobility evolution: excerpts from Uber’s latest industry paper
Products
Company