Key Takeaways
- Magic Pocket is a horizontally scalable exabyte-scale blob storage system and is able to maintain 99.99% availability and extremely high durability.
- The system can run on any HDDs, but primarily runs on Shingled Magnetic Recording disks. It can handle millions of queries per second and automatically identify and repair hundreds of hardware failures per day.
- Each store storage device can contain 100+ drives. Each storage device stores multiple petabytes of data.
- Using erasure codes and other optimizations, Dropbox is able to reduce replication costs while maintaining similar durability as replication.
- Forecasting is crucial to deal with the challenge of storage growth and capacity issues such as dealing with supply chain disruptions.
At QCon San Francisco, I explained how the exabyte scale blob storage system that stores all of Dropbox's customer data works. At its core, Magic Pocket is a very large key-value store where the value can be arbitrarily sized blobs.
Our system has over 12 9s of durability and 99.99% of availability, and we operate across three geographical regions in North America. Our systems are optimized for 4 MB blobs, immutable writes, and cold data.
Magic Pocket manages tens of millions of requests per second and a lot of the traffic comes from verifiers and background migrations. We have more than 600,000 storage drives currently deployed and we run thousands of compute machines.
Object Storage Device
The main focus of the Magic Pocket is the Object Storage Devices (OSD). These devices have over 2 PB of capacity and are made up of around 100 disks per storage machine, utilizing Shingled Magnetic Recording (SMR) technology.
SMR differs from Conventional Magnetic Recording drives as it performs sequential writes instead of random writes, allowing increased density.
The tradeoff of using SMR is that the head erases the next track when you walk over it, preventing random writes in any place.
However, this is perfect for our workload patterns. SMR drives also have a conventional zone that allows for caching of random writes if necessary, which typically accounts for less than 1% of the total capacity of the drive.
Figure 1: SMR Track Layout
At a high level, the architecture of Magic Pocket consists of three zones: West Coast, Central, and East Coast. The system is built around pockets, which represent logical versions of everything in the system. Magic Pocket can have multiple instances, such as a test pocket or a stage pocket before the production one. Databases and compute are not shared between pockets, and operate independently of each other.
Zone
These are the different components of the Magic Pocket architecture in each zone.
Figure 2: How a zone works
The first service is the frontend, which is the service that interacts with the clients. Clients typically make PUT requests with keys and blobs, GET requests, delete calls, or perform scans for available hashes in the system.
When a GET request is made, the hash index, a collection of sharded MySQL databases, is queried. The hash index is sharded by the hash, which is the key for a blob, and each hash is mapped to a cell or a bucket, along with a checksum. The cells are the isolation units where all the storage devices are located: they can be over 100 PBs and grow up to specific limits in size. When the system runs low on capacity, a new cell is opened up, allowing for horizontal scaling of the system.
The cross-zone replicator is the component performing cross-zone replication, storing data in multiple regions. The operation is done asynchronously, and once a commit happens in the primary region, the data is queued up for replication to another zone. The control plane manages traffic coordination, generates migration plans, and handles machine re-installations. It also manages cell state information.
Cell
If I want to fetch a blob, I need to access the bucket service that knows about buckets and volumes: when I ask for a bucket, my request is mapped to a volume and the volume is mapped to a set of OSDs.
Figure 3: How a cell works
Once we find the OSD that has our blob, we can retrieve it. For writing data, the frontend service figures out which buckets are open for writing and it commits to the ones that are ready. The buckets are pre-created for us, and the data is stored in a set of OSDs within a volume.
The coordinator is an important component in the cell, managing all the buckets, volumes, and storage machines. The coordinator constantly checks the health of the storage machines, reconciles information with the bucket service and database, and performs erasure coding and repairs: It optimizes data by moving things around within the cell and takes care of moving data to other machines when it is necessary. The volume manager handles the reading, writing, repairing, and erasure encoding of volumes. Verification steps happen both within and outside of the cell.
Buckets, Volumes, and Extents
We can now dive deeper into the components of the Magic Pocket storage system, namely buckets, volumes, and extents.
Figure 4: Buckets, volumes, and extends
A bucket is a logical storage unit associated with a volume and extent, which represents 1-2 GBs of data on a disk. When we write, we identify the open buckets and the associated OSDs and then write to the extents. The coordinator manages the bucket, volume, and extent information, and can ensure that data is not lost by finding a new placement for a deleted extent. A volume is composed of one or more buckets, it is either replicated or erasure coded, and it is open or closed. Once a volume is closed, it is never opened up again.
How to Find a Blob in Object Storage Devices
In this chapter, we learn how to find a blob in a storage machine. To do this, we store the address of the OSDs with the blob, and we talk directly to those OSDs.
Figure 5: Finding a Blob
The OSDs load up all the extent information and create an in-memory index of which hashes they have to the disk offset. If we want to fetch the block, we need to know the volume and which OSDs have the blob. For a PUT, it's the same process, but we do a write to every single OSD in parallel and do not return until the write has been completed on all storage machines. As the volume is 4x replicated, we have the full copy available in all four OSDs.
Erasure Coding
While failures are happening all the time, 4 copies by 2 zones replication is costly. Let's see the difference between a replicated volume and an erasure-coded volume, and how to handle it.
Figure 6: Erasure Coding
Erasure coding is a way to reduce replication costs while maintaining similar durability as replication. In our system, when a volume is almost full, it is closed and eligible to be erasure coded. We use an erasure code, like Reed Solomon error correction 6 plus 3, with 6 OSDs and 3 parities in a volume group. This means there is a single blob in one data extent, and if one OSD fails, it can be reconstructed. Reconstructions can happen on live requests for data or done in the background as part of repairs. There are many variations of erasure codes with different tradeoffs around overhead: for example, using XOR as an erasure code can be simple, but custom erasure codes can be more suitable.
Figure 7: Failure and erasure coding
The paper "Erasure Coding in Windows Azure Storage" by Huang and others is a useful resource on the topic, and we use similar techniques within our system.
Figure 8: Reed Solomon error correction by "Erasure Coding in Windows Azure Storage" by Huang et al
I previously mentioned an example of Reed Solomon 6, 3 codes with 6 data extents and 3 parities. Another option is called local reconstruction codes, which optimizes read cost. Reed's 6, 3 codes result in a read penalty of 6 reads when there are any failures. However, with the local reconstruction codes, you can have the same read costs for one type of data failure but with a lower storage overhead of roughly 1.33x compared to Reed Solomon's 1.5x replication factor. Although this may not seem like a huge difference, it means significant savings on a larger scale.
Figure 9: Reconstruction code comparisons from "Erasure Coding in Windows Azure Storage" by Huang et al
The local reconstruction codes optimize for one failure within the group, which is usually what you encounter in production. Making this tradeoff is acceptable because more than 2 failures in a volume group are rare.
Even lower replication factors are possible with these codes: the LRC-(12,2,2) code can tolerate any three failures within the group, but not four, with only some failures that can be reconstructed.
The Cold Storage System
Can we do better than this for our system? As we have observed that 90% of retrievals are for data uploaded in the last year and 80% of retrievals happen within the first 100 days, we are exploring ways to improve our cross-zone replication
Figure 10: File access distribution
As we have a large amount of cold data that is not accessed frequently, we want to optimize our workload to reduce reads and maintain similar latency, durability, and availability. To achieve this, we observe that we do not have to do live writes into cold storage and can lower our replication factor from 2x by utilizing more than one region.
Let’s see how our cold storage system works, with the inspiration coming from Facebook's warm blob storage system. The f4 paper suggests a method to split a blob into two halves and take the XOR of those two halves, which are stored individually in different zones. To retrieve the full blob, any one combination of blob1 and blob2 or the XOR must be available in any two regions. However, to do a write, all regions need to be fully available. Note that as the migrations happen in the background and asynchronously, they do not affect the live process.
Figure 11: Splitting blobs and cold storage
What are the benefits of this cold storage system? We have achieved a 25% savings by reducing the replication factor from 2x to 1.5x. The fragments stored in cold storage are still internally erasure-coded, and migration is done in the background. To reduce overhead on backbone bandwidth, we send requests to the two closest zones and only fetch from the remaining zone if necessary. This saves a significant amount of bandwidth as well.
Release Cycle
How do we do releases in Magic Pocket? Our release cycle takes around four weeks across all staging and production environments.
Figure 12: Magic Pocket’s release cycle
Before committing changes, we run a series of unit and integration tests with all dependencies and a durability stage with a full verification of all data. Each zone has verifications that take about a week per stage: our release cycle is fully automated, and we have checks in place that will abort or not proceed with code changes if there are any alerts. Only in exceptional cases do we stop the automatic deployment process and have to take control.
Verifications
What about verifications? Within our system, we conduct a lot of verifications to ensure data accuracy.
Figure 13: Verifications
One of these is performed by the cross-zone verifier, which synchronizes data mappings between clients upstream and the system. Another is the index verifier, which scans the index table to confirm if specific blobs are present in each storage machine: we simply ask if the machine has the blob based on its loaded extents, without actually fetching the content. The watcher is another component that performs full validation of the blobs themselves, with sampling done after one minute, an hour, a day, and a week. We also have the trash inspector, which ensures that all hashes within an extent are deleted once the extent is deleted.
Operations
With Magic Pocket we deal with lots of migrations since we operate out of multiple data centers. We manage a very large fleet of storage machines, and it's important to know what's happening all the time. There's a lot of automated chaos taking place, so we have tons of disaster recovery events to test the reliability of our system: upgrading at this scale is just as difficult as the system itself. Managing background traffic is one of our key operations since it accounts for most of our traffic and disk IOPS. The disk scrubber constantly scans through all of the traffic and checks the checksum for the extents. We categorize traffic by service into different tiers, and live traffic is prioritized by the network.
The control plane generates plans for a lot of the background traffic based on forecasts we have about a data center migration: we take into account the type of migration we are doing, such as for cold storage, and plan accordingly.
We deal with a lot of failures in our system: we have to repair 4 extents every second, which can be anywhere from 1 to 2 GBs in size. We have a pretty strict SLA on repairs (less than 48 hours) and, as it is part of our durability model, we want to keep this repair time as low as possible. Our OSDs get allocated into the system automatically based on the size of the cell and current utilization.
We also have a lot of migrations to different data centers.
Figure 14: Migrations
Two years ago, we migrated out of the SJC region, and it took extensive planning to make it happen. For very large migrations, like hundreds of PBs, there is significant preparation going on behind the scenes, and we give ourselves extra time to make sure that we can finish the migration in time.
Forecasting
Forecasting is a crucial part of managing our storage system at this scale. We are constantly dealing with the challenge of storage growth, which can sometimes be unexpected and require us to quickly adapt and absorb the new data into our system. Additionally, we may face capacity issues due to supply chain disruptions like those caused by the COVID pandemic: as soon as we identify any potential problems, we start working on backup plans as it takes a considerable amount of time to order and deliver new capacity to the data centers. Our forecasts are directly integrated into the control plane, which helps us execute migrations based on the information provided by our capacity teams.
Conclusion
Managing Magic Pocket, four key lessons have helped us maintain the system:
- Protect and verify
- Okay to move slow at scale
- Keep things simple
- Prepare for the worst
First and foremost, we prioritize protecting and verifying our system. It requires a significant amount of overhead, but it's crucial to have end-to-end verification to ensure consistency and reliability.
At this scale, it's important to move slowly and steadily. We prioritize durability and take the time to wait for verifications before deploying anything new. We always consider the risks and prepare for worst-case scenarios.
Simplicity is also a crucial factor. We aim to keep things simple, especially during large-scale migrations, as too many optimizations can create a complicated mental model that makes planning and debugging difficult.
In addition, we always have a backup plan in case of failures or issues during migrations or deployments. We ensure that changes are not a one-way door and can be reversed if necessary. Overall, managing a storage system of this scale requires a careful balance of protection, verification, simplicity, and preparation.