To ensure their Spanner database keeps working reliably, Google engineers use chaos testing to inject faults into production-like instances and stress the system's ability to behave in a correct way in the face of unexpected failures.
As Google engineer James Corbett explains, Spanner builds on solid foundations provided by machines, disks, and networking hardware with a low rate of failure. This is not enough, though, to guarantee it works correctly under any circumstances, including bad memory or data on disks being corrupted, network failures, software errors, and so on.
According to Corbett, using fault-tolerant techniques is key to masking failures and achieving high reliability. These include checksums to detect data corruption, data replication, using the Paxos algorithm for consensus, and others. But to ensure that all of them work as expected, you need to exercise those techniques and ensure they are effective. This is where chaos testing comes into play, consisting of deliberately injecting faults into production-like instances at a much higher rate than they would arise in a production environment.
We run over a thousand system tests per week to validate that Spanner’s design and implementation actually mask faults and provide a highly reliable service. Each test creates a production-like instance of Spanner comprising hundreds of processes running on the same computing platform and using the same dependent systems (e.g., file system, lock service) as production Spanner.
The faults that are injected belong to several categories, including server crashes, file faults, RPC faults, memory/quota faults, and Cloud faults.
Server crashes may be caused by sending a SIGABRT signal at any time to trigger the recovery logic. This logic entails aborting all distributed transactions coordinated by the crashed server, forcing all clients accessing that server to fail over to a distinct one, and using a disk-based log of all operations to avoid losing any memory-only data.
File faults are injected by intercepting all file system calls and randomly changing their outcome, such as returning an error code, corrupting the content on a read or write, or never returning to trigger a timeout.
Another area where a similar approach is followed is interprocess communication with RPC. In this case, RPC calls are intercepted to inject delays, return error codes, and simulate network partitions, remote system crashes, or bandwidth throttling.
Regarding memory faults, the Spanner team focuses on two specific behaviors: simulating a pushback state, whereby a server becomes overloaded and clients start redirecting their requests to less busy replicas, and leaking enough memory so the process is killed. Similarly, they simulate "quota exceeded" errors, whether they come from disk space, memory, or flash storage per user.
Injecting Cloud faults aims to test abnormal situations related to the Spanner API Front End Servers. In this case, a Spanner API Front End Server is crashed to force client sessions to migrate to other Spanner API Front End Servers and make sure this does not entail any impact for clients besides some additional latency.
Finally, Google engineers also simulate an entire region becoming unreachable due to several possible causes, including file system or network outages, thus forcing Spanner to serve data from a quorum of other regions according to the Paxos algorithm.
Thanks to this approach based on a fault-tolerant design coupled with continuous chaos testing, Google efficiently validates Spanner's reliability, concludes Corbett.