Uber migrated all its payment transaction data from DynamoDB and blob storage into a new long-term solution, a purpose-built data store named LedgerStore. The company was looking for cost savings and had previously reduced the use of DynamoDB to store hot data (12 weeks old). The move resulted in significant savings and simplified the storage architecture.
Uber built Gulfstream, its payment platform, in 2017 and used DynamoDB for storage. Due to rising storage costs, DynamoDB was used only for the most recent data (12 weeks), and older data was stored in TerraBlob, an S3-like service created in-house by Uber.
In the meantime, the company started working on a dedicated solution for storing financial transactions with data integrity guarantees. Kaushik Devarajaiah, tech lead at Uber, explains the unique challenges of creating a bespoke data store:
LedgerStore is an immutable storage solution at Uber that provides verifiable data completeness and correctness guarantees to ensure data integrity for these transactions. [...] Considering that ledgers are the source of truth of any financial event or data movement at Uber, it is important to be able to look up ledgers from various access patterns via indexes. This brings in the need for trillions of indexes to index hundreds of billions of ledgers.
LedgerStore supports strongly and eventually consistent indexes. For strongly consistent indexes, the data store uses a two-phase commit. It first persists the indent on the index and subsequently persists the record. Lastly, if the record write succeeds, the intent is asynchronously committed or rolled back in case of failure. If the intent write fails, the whole insert operation fails, as consistency guarantees cannot be supported. During reads, any writes with uncommitted intents are either committed (if the record read succeeds) or deleted (if the record read fails) asynchronously. LedgerStore implements eventually consistent indexes by leveraging materialized views from its home-grown Docstore database, a distributed database built on top of MySQL.
Two-phase Commit Write For Strongly Consistent Indexes (Source: Uber Engineering Blog)
To offload older ledger data to cold storage, LedgerStore uses time-range indexes to support temporal queries. Uber moved from using DynamoDB and Docstore for storing time-range indexes. The original solution utilized two tables in DynamoDB, one optimized for writes and the second for reads. This design was dictated by DynamoDB’s capacity management and avoided hot partitions and throttling. The new design uses the Docstore database with a single table, and leverages prefix scanning for efficient reads.
LedgerStore supports index lifecycle management, automating data reindexing in case the index definition changes. The process creates a new index, backfills the data from the old index, performs relevant validations, swaps indexes, and deletes the old one.
Index Backfilling From Cold Storage (Source: Uber Engineering Blog)
The company faced unique challenges while migrating petabytes of financial transaction data into LedgerStore. It used shadow and offline validation to ensure the correctness of the migration and the performance and scalability of LedgerStore in the production environment. For shadow validation, Gulfstorm was double-writing the data to DynamoDB and LedgerStore and comparing the data returned by reads between the two data stores.
Additionally, Uber implemented offline validation for historical data, coupled with the incremental backfill job running in Apache Spark. The backfill process alone posed significant problems, as the load generated by the process amounted to ten times the usual production load, and the overall process took three months. Engineers used several measures to control the process and mitigate any issues, including dynamic rate control to adjust the processing rate based on the production traffic handled by the platform and emergency stop to halt the process quickly in case of major issues.
Lastly, the team took a conservative approach towards the rollout and adopted a fallback to fetching the data from DynamoDB in case it wasn’t found in LedgerStore. The overall migration was completed successfully, and the company didn’t experience any downtime or outages during and after the migration. The move to LedgerStore resulted in significant cost savings, with estimated yearly savings of over $6 million.
UPDATE: Since this news was published, it has sparked a debate on Hacker News and Reddit, with hundreds of comments posted. Many users discussed the pros and cons of using cloud provider services vs. building solutions, including databases in-house. Some users questioned whether rolling out a custom data store offers a justifiable return on investment, considering staff salaries, maintenance overheads, and the effort spent on data migrations.