Skip to main content
Engineering, Data / ML

Introducing WorkflowGuard: The Workflow Governance and Observability System That Oversees over 120,000 Data Workflows

13 October 2022 / Global
Featured image for Introducing WorkflowGuard: The Workflow Governance and Observability System That Oversees over 120,000 Data Workflows
Image
Figure 1
Image
Figure 2
Suspend– Workflow retention period reached
– Owner left Uber
– Task kept failing
– Heavy query/resource quota exceeded
Delete– Workflow has been in inactive/paused status for more than 180 days
Tier Downgrade– Workflow alert setting does not meet tier-based criteria (for example, page alert is required for Tier 1 workflows)
– Workflow ownership unknown/unclear
– 90 days’ success rate drops below tier-based threshold
Resource TypeBefore Governance EnforcementAfter Governance Enforcement
Workflowretention periodUnlimitedVaries based on workflow tier, can be as long as one year
Number of tasks per workflowUnlimited2K 
Max JSON size of API based workflowUnlimited2MB 
Minimum schedule intervalNo restriction15 min for scheduled workflows, no restriction for event-trigger workflows
Task RetriesUnlimited
Task timeoutAs long as 48 hours8 hours
Workflow/Job execution historyUnlimitedJob history is kept for up to 2 years or up to the last 1,000 executions
Resource pool accessNo restrictionRestricted access based on LDAP and task type
Consecutively failing workflowsNo restrictionAuto-pause after consecutively failing for the last 10 scheduled runs
Image
Figure 3
Image
Figure 4
Image
Figure 5
Image
 Figure 6
Image
   Figure 7:
Image
Figure 8
Image
Figure 9
Image
Figure 10
Chengchun Yan

Chengchun Yan

Chengchun Yan is a Senior Software Engineer on Uber’s Data Workflow Platform Team. She works on building workflow governance, improving workflow reliability, resource cost-efficiency and user observability.

Jing Shi

Jing Shi

Jing Shi is a Senior Engineering Manager for the Data Workflow Platform team at Uber. Her team builds a scalable, reliable, and multi-region platform to orchestrate, schedule, and execute production grade repetitive data jobs. Users can leverage their unified workflow UI and ecosystem to author, manage, and govern streaming and batch workflows with self-serve, automation capability.

Sudhir Mallem

Sudhir Mallem

Sudhir Mallem is a Sr. Staff Engineer on the Data Infrastructure team at Uber. He has worked on building and scaling the global data warehouse. Over the last year he has been focusing on several projects on cost-efficiency and reliability of data workflows and supporting the Data Workflow Platform team to scale their services.

Posted by Chengchun Yan, Jing Shi, Sudhir Mallem