Key Takeaways
- Data storage and access at the edge deliver massive performance gains by reducing location-sensitive latency.
- Storing and managing relational data at the edge carries a unique set of challenges dictated by timeless CAP constraints and highly variable load conditions, requiring careful tradeoffs.
- Cloudflare operates a distributed cross-region database architecture, distributing PostgreSQL across multiple regions for resilience and quick failovers.
- Replication lag poses a significant challenge. Especially for distributed replicated systems, architecting for degraded states is much harder than for failure states.
- Embedded at the edge and colocation of storage and compute is the future of relational data.
This is a summary of a talk we gave at QCon San Francisco in October 2023 where we discussed the high availability setup and considered the tradeoffs that Cloudflare had to make around each part of the system. We will explore some of the performance challenges that Cloudflare faced when bringing the database infrastructure closer than ever to the edge and dive deep into the solutions that have been implemented.
What is Edge?
The edge, in the context of distributed systems, refers to services located geographically close to clients. This includes tasks like DNS resolution, content delivery, and IP firewalling. Traditionally, centralized data centers hosted services like relational databases, which were farther from clients. Proximity to clients defines the metric, reducing network latency and costs by serving requests nearer to the client's location.
The Control Plane
Cloudflare safeguards over 27 million internet properties, managing 46 million HTTP requests per second on average. The network spans 250 global Points of Presence locations, handling over 55 million row operations per second at peak on the busiest PostgreSQL cluster, with over 50 terabytes of data across all clusters.
Cloudflare's control plane orchestrates numerous microservices, powering the dashboard. It manages rules and configurations for crucial network services along the data path. The database clusters store data for each customer such as DNS record changes, firewall rules, DDoS mitigation, API gateway routes, and internal service information like billing entitlements and user authentication. Application teams often leverage PostgreSQL for unique purposes, using stored procedures for executing business logic with transactional consistency. PostgreSQL is also utilized as an outbox queue, capturing domain events such as generated DDoS rules from traffic analysis in a centralized data center.
A separate daemon retrieves these events from the database and channels them through Kafka to edge services, allowing latency-critical services to bypass direct database access and benefiting from the durability provided by PostgreSQL and Kafka.
Cloudflare runs the software and service stack on bare metal, dealing with rack-mounted servers, high-bandwidth network cards, and operational maintenance. On-premise data storage provides maximum flexibility, allowing the team to fine-tune elements like solid-state RAID configuration and enhance the system with an open-source cluster management system. This transparency and fine-grained control of the cluster would be impossible with managed cloud databases like Amazon RDS.
Unfortunately, this implies there is no instant autoscaling button to boost the capacity of the cluster when load increases, an inherent challenge in running the entire stack on-premise.
Architecture
Let's discuss Cloudflare's database architecture at the edge which needs to process transactions on the order of millions per second.
In designing the system, the team prioritizes high availability, aiming for a Service-Level Objective (SLO) of five 9s, with a maximum of five minutes and a half of downtime per year across the entire stack. The transactional workloads lean towards read-heavy operations. The infrastructure has to handle a high rate of read and write operations with minimal latency, as well as maintaining fault tolerance. Cloudflare leverages internally an anycast BGP network, with clients naturally load balanced across the PgBouncer proxy instances.
While BGP Anycast optimizes read queries in proximity to clients, write queries are forwarded to the primary region where the primary database instance resides.
Figure 1: Cloudflare database architecture - image source
At the top, PgBouncer manages database server connection pools for the application clients. HAProxy then load balances queries across multiple database cluster instances, preventing overload. In a typical setup, a primary instance replicates data to multiple replicas for a high rate of read queries, managed by the stolon high-availability cluster manager backed by etcd. In this setup, etcd is used for distributed cluster leadership consensus and configuration consistency.
The active-standby model ensures cross-region data redundancy, with the primary region in Portland handling inbound queries and the standby region in Luxembourg ready for traffic redirection or failovers.
We will now analyze each layer, exploring components and tradeoffs.
Persistence at the Edge
The decision to build a highly distributed and relational edge database led the team to consider fundamental architecture principles, particularly the CAP theorem. Since the application is geographically distributed and relies on open-source software like PostgreSQL, you have to prioritize either consistency or high availability, you can have only one of the two, usually favoring the latter.
In a single data center, the typical architecture includes a primary database, a synchronous replica, and several asynchronous replicas. This topology is replicated across multiple data centers. The semi-synchronous replication setup ensures that if an asynchronous replica goes down, it does not impact applications significantly. The same tradeoffs apply to cross-region replication, allowing applications in one region to continue working if another region goes down. However, if the synchronous replica fails, the entire application halts to prevent inconsistencies, making it the chosen primary during failovers. This semi-synchronous replication topology balances the requirements effectively.
PostgreSQL, originating from the '90s in Berkeley, was initially designed with a monolithic architecture. To transform it into a distributed system, we heavily rely on two replication aspects: logical replication and streaming replication. Cascading replication, a third option, is built on top of logical and streaming replication.
Before delving into replication's role in creating a distributed cluster, let's briefly touch on write-ahead logs (WAL). In any relational database, including MySQL or Oracle, the durability in ACID is achieved through WAL. Changes to the database, initially saved in volatile memory and asynchronously flushed to disk, are first sequentially recorded to a disk-based WAL file. This ensures data durability via recovery in the case of a crash.
This feature proves vital in building replication systems, conveniently enabling us to capture and replicate units of work (each log entry).
Streaming replication mode is straightforward: a replica establishes a TCP connection and efficiently streams log entries from the primary to another replica. The notable advantage is its performance, capturing changes at up to one terabyte per second with minimal delay, though we will cover potential sources of delay later in the article. A caveat is that it's an all-or-nothing approach, replicating all changes to every Postgres object due to its filesystem-level block replication.
Logical replication, a newer version, operates at the SQL level, providing flexibility in replicating data at the table, row, or column level. However, it is not possible to replicate DDL changes, requiring custom tools, and it has scalability challenges, especially at the multi-terabyte scale.
Cluster Management
One of the primary reasons for cluster management is addressing database failures. Failures, whether logical, like data corruption, or more severe, such as hardware-based failures due to natural disasters, are inevitable. Manual failovers in these situations are cumbersome. Additionally, around 5% of Cloudflare commodity servers are faulty at any point in time, translating to ongoing failures across a fleet of thousands of servers and over multiple data centers. An automated cluster management system is essential for efficiently handling these diverse and critical failures.
The team opted for stolon, an open-source cluster management system written in Go, running as a thin layer on top of the PostgreSQL cluster, with a PostgreSQL-native interface and support for multiple-site redundancy. It is possible to deploy a single stolon cluster distributed across multiple PostgreSQL clusters, whether within one region or spanning multiple regions. Stolon's features include stable failover with minimal false positives, with the Keeper Nodes acting as the parent processes managing PostgreSQL changes. Sentinel Nodes function as orchestrators, monitoring Postgres components' health and making decisions, such as initiating elections for new primaries. Proxy Nodes handle client connections, ensuring a single write primary to avoid multi-master situations.
Figure 2: stolon’s architecture
Connection Pooling
Database connections, being finite resources, require efficient management due to the inherent overhead.
PostgreSQL connections rely on the TCP protocol, involving a three-way handshake and an SSL handshake for communication security. Each connection dedicates a separate OS-level process, necessitating forking by the main postmaster process, leading to CPU time consumption and memory usage. As thousands of concurrent connections are opened, socket descriptors and CPU time allocated for transaction processing diminish.
A server-side connection pooler manages a fixed number of database connections while exposing a PostgreSQL-compatible interface to the clients. As an intermediary, it reduces the total open connections by recycling them as clients connect through it to the database.
This centralizes control over tenant resources, allowing the team to regulate the number of allocated connections per upstream application service. Cloudflare’s enhancements to PgBouncer include features inspired by TCP Vegas, a TCP congestion avoidance algorithm that enforces stricter multi-tenant resource isolation by restricting tenant total throughput.
Cloudflare chose PgBouncer as the connection pooler due to its compatibility with PostgreSQL protocol: the clients can connect to PgBouncer and submit queries as usual, simplifying the handling of database switches and failovers. PgBouncer, a lightweight single-process server, manages connections asynchronously, allowing it to handle more concurrent connections than PostgreSQL.
Figure 3: How PgBouncer works
PgBouncer introduces client-side connections, separate from direct server PostgreSQL connections, and employs an efficient non-blocking model for network I/O, utilizing a single thread and the OS-provided epoll mechanism for event monitoring.
Figure 4: One PgBouncer process
Unlike PostgreSQL's thread-per-connection model, PgBouncer's single-threaded event loop minimizes memory usage by avoiding multiple thread stacks and improves CPU utilization by servicing requests across all clients in one thread, preventing wasted CPU time from idle connection threads.
To address the challenge of enabling multiple single-threaded PgBouncer processes to utilize all CPU cores on a single machine, the solution was found in the operating system: the team added the SO_REUSEPORT socket option to allow multiple process sockets to simultaneously listen on the same port.
Load Balancing
Cloudflare uses the HAProxy load balancer to evenly distribute incoming database queries across multiple PostgreSQL servers, preventing overloads against any single server. Similar to PgBouncer, HAProxy offers high availability and fault tolerance by redirecting traffic from failed servers to healthy ones which reduces downtime from degraded database instances. HAProxy efficiently handles TCP connections at Layer 4 with minimal overhead by using the kernel splice system call. Inbound and outbound TCP streams are attached within the kernel itself, enabling data transfer without the need to copy it into userspace.
Challenges and Solutions
Let's explore challenges in the database infrastructure and share some performance tips for achieving high availability at the edge. Replication lag is especially pronounced under heavy traffic and write-intensive operations like executing ETL jobs. Any bulk write operations, such as data migrations and GDPR compliance deletions, and even automatic storage compaction (autovacuum) also amplify the replication lag.
The team targets a 60-second replication lag SLO, advising application teams accordingly. To minimize the replication lag, SQL query writes are batched into smaller chunks, ensuring smoother replication. Read-after-write consistency is maintained by caching or reading directly after writing to the primary or to a synchronous replica. An unconventional approach to avoid lag is ejecting all replicas from the cluster, leaving only the primary. While this has been effective in practice, it requires a deep understanding of query workloads and potential system changes. You can think of this as the unsafe keyword in Rust.
In 2020, a major incident severely impacted database performance and availability at Cloudflare. Cascading failures led to a 75% drop in API availability and 80x slower dashboards. The primary database instances in both main and standby regions executed a failover, but the main region’s new primary struggled under the load and failed again. With no synchronous replicas to promote in the cluster, a crucial decision had to be made: promote an asynchronous replica with potential data loss or evacuate regions to the standby cluster, incurring additional downtime.
Data loss is unacceptable, hence the latter option was chosen. Further investigation revealed a partially failed network switch, operating in a degraded state and blocking communication between both cross-region etcd clusters.
The non-communication caused the Raft protocol to enter a deadlock state, causing read-only clusters. Failovers and resynchronization exposed PostgreSQL inefficiencies, particularly in resync times. The setup handles full failures, like machine crashes, effectively, but issues arise with undefined behavior when nodes in your Raft cluster start providing conflicting information.
For example, nodes 1, 2, and 3 experienced degraded network connectivity due to a faulty switch. Node 1 mistakenly thought node 3 was no longer the leader, causing a series of failed leadership elections and creating a deadlock. This deadlock forced the cluster into read-only mode, disrupting communication between regions and triggering a failover to primary replicas.
Figure 5: Conflicting information in the cluster
During failover, the synchronous replica is promoted, requiring the old primary to undo committed transactions due to potential divergence in transaction history. After unwinding the divergent history, the synchronous replica must receive and replay new transactions from the corrected primary, which, in our case, failed due to the absence of synchronous replicas to absorb the new data, resulting in downtime.
The identified issues include a hardware failure, a rare Byzantine Fault with Raft, and prolonged Postgres resynchronization time. The team prioritized optimizing Postgres to address the third issue. Analyzing the logs revealed that most time was spent in rsync, copying over 1.5 terabytes of log files during resynchronization.
Figure 6: pg_rewind copying all the logs, including the common segments before the last divergence point
The solution involved optimizing Postgres internally by copying only the necessary files from the last divergence point, reducing data transfer to 5% of the original size. Optimizing PostgreSQL internally reduced replica rebuild times from 2+ hours to 5 minutes, a 95% speedup.
Figure 7: Patched pg_rewind copying logs only from the last divergence point
The lessons learned from the large-scale incident include:
- anticipating degraded states and not just fully failed states
- investing in open-source contributions for in-house expertise. The open-source patch submitted contributes to PostgreSQL CommitFest.
- the importance of building and fixing software internally.
Access Data from the Edge
Maintaining a single cluster in one region isn't sufficient. To address various failures, including challenging scenarios like Byzantine hardware failure, you need to distribute PostgreSQL clusters across multiple regions. This cluster-of-clusters approach, seen from a Cloudflare Workers’ standpoint, enhances resilience. Cloudflare expanded from a US-based primary region to include Europe and Asia, forming a hub-and-spoke model using PostgreSQL streaming replication.
Ensuring synchronization across regions is crucial, handling issues like replication lag:: when faced with high replication lag, diverting traffic back to the primary region helps meet SLAs for application teams. This strategy enhances system robustness and mitigates the impact of potential failures.
A distributed cluster offers significant benefits, particularly in implementing smart failovers. By strategically choosing the primary region based on factors like current load, available capacity, or time zones ("follow the sun"), we optimize for lower latency during active hours in different parts of the world. This dynamic approach addresses PostgreSQL's single-primary limitation.
Capacity considerations also guide the choices: certain regions may have more capacity, and logistical challenges, like hardware shipments during COVID, impact decisions. Traffic patterns and compliance requirements further influence region selection, allowing the team to address specific regulatory needs efficiently. Adopting a distributed approach enables Cloudflare to navigate these challenges effectively.
Expanding PostgreSQL across multiple regions and achieving quick failovers, as discussed earlier with tools like pg_rewind, is efficient. However, the complexity arises when considering application dependencies. While database migration is manageable, failing over entire application ecosystems, including distributed components like Kafka message queues, is a challenge. Understanding the full application dependency graph is crucial when switching over regions.
Figure 8: The full application dependency graph
Despite ongoing automation efforts, coordination between teams, across different organizations, becomes critical. The sequence to switch services across regions involves considering database clusters, application services (e.g., authentication and configuration) in Kubernetes clusters, and their reliance on central infrastructure like R2 and SSL.
Database Trends
As practitioners, we have observed key trends in relational databases. First, there's a move towards embedding data at the edge, using monolithic or PostgreSQL in a few regions, complemented by embedded databases like SQLite for real edge processing. It is vital to keep SQLite at the edge syncronized with PostgreSQL data at core/central locations. Furthermore, ensuring SQLite is PostgreSQL wire-compatible is crucial for seamless integration.
Another trend involves persistence at the edge, with Cloudflare Workers processing client-side data. Additionally, there's a shift towards the colocation of storage and compute, illustrated by features like Smart Placement.
Figure 9: Cloudflare Workers colocated with the database cluster
This reduces latency, by avoiding cross-region network hops to an otherwise central data storage and addresses the realization that many client applications spend considerable time communicating with databases.
Lastly, data localization, particularly in Europe using PostgreSQL, is a challenge. Logical replication is a focus, offering flexibility in replicating on a row or column level, although it's still in an exploratory phase. The expectation is that logical replication will play a significant role in addressing this challenge.