In a recent move, DoorDash has significantly optimized its cloud infrastructure costs. The company faced increased cross-AZ data transfer costs when transitioning to a microservices architecture. To substantially reduce this cost, DoorDash implemented zone-aware routing with its Envoy-based service mesh, taking advantage of its Cell-Based Architecture.
DoorDash's implementation of zone-aware routing in its Envoy-based service mesh was vital in reducing cloud infrastructure costs. This implementation allowed DoorDash to efficiently direct traffic within the same availability zone (AZ), minimizing the more expensive cross-AZ data transfers.
With Envoy's zone-aware routing feature, caller services prefer directing traffic to callee services in the same AZ, thereby reducing cross-AZ data transfer costs. The "Before" figure below shows how pods communicate with each other using a simple round-robin load balancer across AZs, incurring additional charges. In contrast, the "After" figure shows how zone-aware routing enables preferring services within the same zone.
Simple round-robin load balancing between pods (Source)
Zone-aware routing between pods (Source)
To enable zone-aware routing, DoorDash modified its in-house custom service mesh control plane to provide Envoy with the AZ information for each node, as seen in the example below.
resources:
- "@type": type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment
cluster_name: payment-service.service.prod.ddsd
endpoints:
- locality:
zone: us-west-2a
lb_endpoints:
- endpoint:
address:
socket_address:
address: 1.1.1.1
port_value: 80
- locality:
zone: us-west-2b
lb_endpoints:
- endpoint:
address:
socket_address:
address: 2.2.2.2
port_value: 80
- locality:
zone: us-west-2c
lb_endpoints:
- endpoint:
address:
socket_address:
address: 3.3.3.3
port_value: 80
Example of an endpoint discovery response - note the added locality information (Source)
DoorDash's Cell-Based Architecture heavily contributed to the success of this move. A Cell-Based Architecture "comes from the concept of a bulkhead in a ship, where vertical partition walls subdivide the ship's interior into self-contained, watertight compartments." Software architects replicate this pattern in complex systems to allow fault isolation. Fault-isolated boundaries restrict the impact of a failure within a workload to a limited number of components, leaving components outside of the boundary unaffected by the failure.
Slack recently showcased its usage of Cell-Based Architecture to mitigate grey failures.
Within DoorDash's Cell-Based Architecture, each cell consists of multiple Kubernetes clusters, and each microservice is deployed exclusively to one cluster within a given cell. DoorDash's engineers deployed each Kubernetes cluster across multiple AZs to enhance availability and fault tolerance.
Cell-based multi-cluster deployments (Source)
By enabling zone-aware routing within these cells, DoorDash effectively localized traffic, further reducing cross-availability zone data transfers. This approach not only optimized network efficiency but also enhanced the system's overall resilience, as it minimized the impact of failures within any single cell, contributing to the robustness of DoorDash's microservices ecosystem.
The authors, Hochuen Wong and Levon Stepanian don't disclose the savings percentage itself. Still, they state that "these actions made such a material dent in DoorDash's data transfer costs [...] that it caused our cloud provider to reach out to us asking whether we were experiencing a production-related incident." They conclude that:
Cloud service provider data transfer pricing is more complex than it initially seems. It's worth the time investment to understand pricing models in order to build the correct efficiency solution.
It's challenging to build a comprehensive understanding/view of all cross-AZ traffic. Nonetheless, combining network bytes metrics from different sources can be enough to identify hotspots that, when addressed, can make a material dent in usage and cost.
As the number of hops increases in microservice call graphs, the likelihood of data being transmitted across AZs grows, increasing the complexity of ensuring that all hops support zone-aware routing.
The authors recommend owners of microservices-based systems look into their data transfer cost and consider a service mesh not only for its traffic management features, but also for its potential for greater efficiency.