Skip to main content
Engineering, Data / ML

Spark Analysers: Catching Anti-Patterns In Spark Apps

June 1, 2023 / Global
Featured image for Spark Analysers: Catching Anti-Patterns In Spark Apps
Image
Figure 1: Architecture
Drogon is Spark as a Service offering at Uber
YARNRed (YARN Reduction)Effort to bring down YARN consumption at Uber
Image
Figure 2: Spark Events
Image
Figure 3: Spark Plan
Image
Figure 4: Scan Recommendations
Image
Figure 5: Excessive Partition Scan Analyser
Image
Figure 6: Spark DAG
Image
Figure 7: Spark Transformations & Actions
Image
Figure 8: Duplicate Spark Plan Analyser
Image
Figure 9: Spark Plan Example
Image
Image
Figure 10: YARRed Architecture
Image
Figure 11: Sample Jira Ticket
vCoresCPU virtual cores utilized
uCoresMax (vCores, memory utilized/SKU ratio) 
This SKU ratio is derived using the memory present per vCore in our YARN fleet. As of now, every 1 vCore maps to approximately 4.4GB of memory available in our YARN cluster
PluggableEasily develop new analysers and plug them into the existing system
ExtensibleExtend existing analysers based on the use cases
ScalableThe Flink application can be scaled independently of the Spark application
DecoupledThe overall architecture is layered and decoupled so that each component can be scaled independently
Vijayant Soni

Vijayant Soni

Vijayant Soni is a Software Engineer on Uber's Delivery Data Solutions Team. He has worked on enhancing Uber's ETL frameworks to avoid pipeline duplication for different environments and to perform small file compaction with a single feature flag. He ideated and developed Spark Analysers to uncover the most common issues users face when writing a Spark application. He is currently working on decentralizing a huge hive database (~10 Petabytes) to achieve better scalability and sustain significant data growth at Uber.

Sashidhar Thallam

Sashidhar Thallam

Sashidhar Thallam is a former Staff Software Engineer on Uber’s Delivery Data Solutions team. He was working on automations to optimize resource usage for all the Hive workloads. He built Query-Analysers which detects a number of antipatterns in Hive queries and suggests improvements to the query owners.

Sakshi Pande

Sakshi Pande

Sakshi Pande is a Software Engineer on the Data Chargebacks and Consumption Reduction team. She is one of the early engineers involved with the Chargeback and Cost Efficiency initiative since Nov '21, playing a crucial role in initiatives like HDFSRed, YARNRed, PrestoRed.

Atul Mantri

Atul Mantri

Atul Mantri is a Senior Software Engineer on Uber's Data Platform team. He is focused on building systems that enable big data observability across all batch and real-time applications at Uber and turbocharging the cost-efficiency initiatives in the platform. Before Uber, Atul worked at Rubrik and Netapp building high-performance distributed systems. He holds a Masters degree from NC State University.

Posted by Vijayant Soni, Sashidhar Thallam, Sakshi Pande, Atul Mantri