Unleashing the power of Presto: The Uber case study

[ad_1]

The magic behind Uber’s data-driven success

Uber, the ride-hailing large, is a family title worldwide. All of us acknowledge it because the platform that connects riders with drivers for hassle-free transportation. However what most individuals don’t understand is that behind the scenes, Uber isn’t just a transportation service; it’s a knowledge and analytics powerhouse. Daily, thousands and thousands of riders use the Uber app, unwittingly contributing to a posh internet of data-driven selections. This weblog takes you on a journey into the world of Uber’s analytics and the vital function that Presto, the open supply SQL question engine, performs in driving their success.

Uber’s DNA as an analytics firm

At its core, Uber’s enterprise mannequin is deceptively easy: join a buyer at level A to their vacation spot at level B. With just a few faucets on a cell system, riders request a experience; then, Uber’s algorithms work to match them with the closest accessible driver and calculate the optimum worth. However the simplicity ends there. Each transaction, each cent issues. A ten-cent distinction in every transaction interprets to a staggering $657 million yearly. Uber’s prowess as a transportation, logistics and analytics firm hinges on their capability to leverage knowledge successfully.

The pursuit of hyperscale analytics

The size of Uber’s analytical endeavor requires cautious collection of knowledge platforms with excessive regard for limitless analytical processing. Think about the magnitude of Uber’s footprint.¹ The corporate operates in additional than 10,000 cities with greater than 18 million journeys per day. To keep up analytical superiority, Uber retains 256 petabytes of information in retailer and processes 35 petabytes of information each day. They help 12,000 month-to-month energetic customers of analytics operating greater than 500,000 queries each single day.

To energy this mammoth analytical enterprise, Uber selected the open supply Presto distributed question engine. Groups at Fb developed Presto to deal with excessive numbers of concurrent queries on petabytes of information and designed it to scale as much as exabytes of information. Presto was capable of obtain this degree of scalability by utterly separating analytical compute from knowledge storage. This allowed them to concentrate on SQL-based question optimization to the nth diploma.

What’s Presto?

Presto is an open supply distributed SQL question engine for knowledge analytics and the information lakehouse, designed for operating interactive analytic queries in opposition to datasets of all sizes, from gigabytes to petabytes. It excels in scalability and helps a variety of analytical use circumstances. Presto’s cost-based question optimizer, dynamic filtering and extensibility by user-defined features make it a flexible instrument in Uber’s analytics arsenal. To attain most scalability and help a broad vary of analytical use circumstances, Presto separates analytical processing from knowledge storage. When a question is constructed, it passes by a cost-based optimizer, then knowledge is accessed by connectors, cached for efficiency and analyzed throughout a sequence of servers in a cluster. Due to its distributed nature, Presto scales for petabytes and exabytes of information.

The evolution of Presto at Uber

Starting of a knowledge analytics journey

Uber started their analytical journey with a conventional analytical database platform on the core of their analytics. Nevertheless, as their enterprise grew, so did the quantity of information they wanted to course of and the variety of insight-driven selections they wanted to make. The associated fee and constraints of conventional analytics quickly reached their restrict, forcing Uber to look elsewhere for an answer.

Uber understood that digital superiority required the seize of all their transactional knowledge, not only a sampling. They stood up a file-based data lake alongside their analytical database. Whereas this side-by-side technique enabled knowledge seize, they rapidly found that the information lake labored properly for long-running queries, however it was not quick sufficient to help the near-real time engagement vital to keep up a aggressive benefit.

To deal with their efficiency wants, Uber selected Presto due to its capability, as a distributed platform, to scale in linear style and due to its dedication to ANSI-SQL, the lingua franca of analytical processing. They arrange a few clusters and commenced processing queries at a a lot sooner pace than something that they had skilled with Apache Hive, a distributed data warehouse system, on their knowledge lake.

Continued excessive development

As the usage of Presto continued to develop, Uber joined the Presto Basis, the impartial governing physique behind the Presto open supply challenge, as a founding member alongside Fb. Their preliminary contributions had been primarily based on their want for development and scalability. Uber targeted on contributing to a number of key areas inside Presto:

Automation: To help rising utilization, the Uber workforce went to work on automating cluster administration to make it easy to maintain up and operating. Automation enabled Uber to develop to their present state with greater than 256 petabytes of information, 3,000 nodes and 12 clusters. In addition they put course of automation in place to rapidly arrange and take down clusters.

Workload Administration: As a result of totally different sorts of queries have totally different necessities, Uber made certain that visitors is well-isolated. This permits them to batch queries primarily based on pace or accuracy. They’ve even created subcategories for a extra granular strategy to workload administration.

As a result of a lot of the work carried out on their knowledge lake is exploratory in nature, many customers wish to execute untested queries on petabytes of information. Massive, untested workloads run the chance of hogging all of the sources. In some circumstances, the queries run out of reminiscence and don’t full.

To deal with this problem, Uber created and maintains pattern variations of datasets. In the event that they know a sure consumer is doing exploratory work, they merely route them to the sampled datasets. This manner, the queries run a lot sooner. There could also be inaccuracy due to sampling, however it permits customers to find new viewpoints inside the knowledge. If the exploratory work wants to maneuver on to testing and manufacturing, they will plan appropriately.

Safety: Uber tailored Presto to take customers’ credentials and move them right down to the storage layer, specifying the exact knowledge to which every consumer has entry permissions. As Uber has carried out with a lot of its additions to Presto, they contributed their safety upgrades again to the open supply Presto challenge.

The technical worth of Presto at Uber

Analyzing complicated knowledge sorts with Presto

As a digital native firm, Uber continues to develop its use circumstances for Presto. For conventional analytics, they’re bringing knowledge self-discipline to their use of Presto. They ingest knowledge in snapshots from operational methods. It lands as uncooked knowledge in HDFS. Subsequent, they construct mannequin knowledge units out of the snapshots, cleanse and deduplicate the information, and put together it for evaluation as Parquet information.

For extra complicated knowledge sorts, Uber makes use of Presto’s complicated SQL options and features, particularly when coping with nested or repeated knowledge, time-series knowledge or knowledge sorts like maps, arrays, structs and JSON. Presto additionally applies dynamic filtering that may considerably enhance the efficiency of queries with selective joins by avoiding studying knowledge that will be filtered by be a part of situations. For instance, a parquet file can retailer knowledge as BLOBS inside a column. Uber customers can run a Presto question that extracts a JSON file and filters out the information specified by the question. The caveat is that doing this defeats the aim of the columnar state of a JSON file. It’s a fast solution to do the evaluation, however it does sacrifice some efficiency.

Extending the analytical capabilities and use circumstances of Presto

To increase the analytical capabilities of Presto, Uber makes use of many out-of-the-box features supplied with the open supply software program. Presto gives a protracted record of features, operators, and expressions as a part of its open supply providing, together with normal features, maps, arrays, mathematical, and statistical features. As well as, Presto additionally makes it simple for Uber to outline their very own features. For instance, tied intently to their digital enterprise, Uber has created their very own geospatial features.

Uber selected Presto for the flexibleness it gives with compute separated from knowledge storage. Consequently, they proceed to develop their use circumstances to incorporate ETL, data science, knowledge exploration, on-line analytical processing (OLAP), knowledge lake analytics and federated queries.

Pushing the real-time boundaries of Presto

Uber additionally upgraded Presto to help real-time queries and to run a single question throughout knowledge in movement and knowledge at relaxation. To help very low latency use circumstances, Uber runs Presto as a microservice on their infrastructure platform and strikes transaction knowledge from Kafka into Apache Pinot, a real-time distributed OLAP knowledge retailer, used to ship scalable, real-time analytics.

In response to the Apache Pinot web site, “Pinot is a distributed and scalable OLAP (On-line Analytical Processing) datastore, which is designed to reply OLAP queries with low latency. It could possibly ingest knowledge from offline batch knowledge sources (akin to Hadoop and flat information) in addition to on-line knowledge sources (akin to Kafka). Pinot is designed to scale horizontally, in order that it could deal with massive quantities of information. It additionally gives options like indexing and caching.”

This mix helps a excessive quantity of low-latency queries. For instance, Uber has created a dashboard referred to as Restaurant Supervisor through which restaurant homeowners can take a look at orders in actual time as they’re coming into their eating places. Uber has made the Presto question engine connect with real-time databases.

To summarize, listed here are a few of the key differentiators of Presto which have helped Uber:

Pace and Scalability: Presto’s capability to deal with large quantities of information and course of queries at lightning pace has accelerated Uber’s analytics capabilities. This pace is important in a fast-paced trade the place real-time decision-making is paramount.

Self-Service Analytics: Presto has democratized data entry at Uber, permitting knowledge scientists, analysts and enterprise customers to run their queries with out relying closely on engineering groups. This self-service analytics strategy has improved agility and decision-making throughout the group.

Knowledge Exploration and Innovation: The pliability of Presto has inspired knowledge exploration and experimentation at Uber. Knowledge professionals can simply take a look at hypotheses and acquire insights from massive and numerous datasets, resulting in steady innovation and repair enchancment.

Operational Effectivity: Presto has performed an important function in optimizing Uber’s operations. From route optimization to driver allocation, the power to research knowledge rapidly and precisely has led to value financial savings and improved consumer experiences.

Federated Knowledge Entry: Presto’s help for federated queries has simplified knowledge entry throughout Uber’s numerous knowledge sources, making it simpler to harness insights from a number of knowledge shops, whether or not on-premises or within the cloud.

Actual-Time Analytics: Uber’s integration of Presto with real-time knowledge shops like Apache Pinot has enabled the corporate to supply real-time analytics to customers, enhancing their capability to watch and reply to altering situations quickly.

Neighborhood Contribution: Uber’s energetic participation within the Presto open supply neighborhood has not solely benefited their very own use circumstances however has additionally contributed to the broader improvement of Presto as a robust analytical instrument for organizations worldwide.

The ability of Presto in Uber’s data-driven journey

At the moment, Uber depends on Presto to energy some spectacular metrics. From their newest Presto presentation in August 2023, right here’s what they shared:

Uber’s success as a data-driven firm isn’t any accident. It’s the results of a deliberate technique to leverage cutting-edge applied sciences like Presto to unlock the insights hidden in huge volumes of information. Presto has turn out to be an integral a part of Uber’s knowledge ecosystem, enabling the corporate to course of petabytes of information, help numerous analytical use circumstances, and make knowledgeable selections at an unprecedented scale.

Getting began with Presto

If you happen to’re new to Presto and wish to test it out, we suggest this Getting Started web page the place you may strive it out.

Alternatively, should you’re able to get began with Presto in manufacturing you may take a look at IBM watsonx.data, a Presto-based open knowledge lakehouse. Watsonx.knowledge is a fit-for-purpose knowledge retailer, constructed on an open lakehouse structure, supported by querying, governance and open knowledge codecs to entry and share knowledge.

Request a live demo here to see Presto and watsonx.data in action

Try watsonx.data for free

1 Uber. EMA Technical Case Research, sponsored by Ahana. Enterprise Administration Associates (EMA). 2023.

Chair, Presto Neighborhood Group and Neighborhood at IBM