At Uber, over 120,000 production workflows are orchestrated, scheduled, and executed every day. These workflows are owned by over 3,000 users across many teams within Uber, powering critical ETL jobs, business metrics, dashboards, machine learning models, or critical regulatory reports. Internally, the Data Workflow Platform (DWP) team makes this happen by developing Uber’s centralized workflow management system with high infrastructure reliability and minimum scheduling latency. The workflow management system also comes with a user-friendly application that allows users to create, author, and manage both streaming and batch workflows in a self-serve way.
However, problems arise as the number of workflows running on the platform grows. One problem is the increasing compute resource demand from rapidly growing workflows. With the straightforward workflow management UI, authoring and managing workflows has become a cakewalk, which can be finished in minutes with a few clicks on the website. As a result, the number of workflows and their executions has continued to increase over the past few years. For instance, we have a Presto Task that executed approximately 240,000 Presto queries weekly at the beginning of H1 2021, and that number was growing at a steady pace of 4% per week. At this rate, it is estimated that the number of active Presto queries would double every 5 months with additional clusters required.