BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Indestructible Storage in the Cloud with Apache Bookkeeper

Indestructible Storage in the Cloud with Apache Bookkeeper

Key Takeaways

  • The traditional way to make storage systems cloud-aware is a lift-and-shift approach. While it works, in our experience, investing in refactoring the architecture to be more cloud-aware can yield better results.
  • Current workarounds to deploy Apache BookKeeper in a cross-zone environment requires manual mapping of BookKeeper storage nodes to specific regions/availability zones, which doesn’t provide the same durability and availability guarantees during zone outages.
  • Salesforce’s unique approach to making Apache BookKeeper cloud-aware includes providing intelligence to BookKeeper storage nodes such that they function effectively in a cloud-based deployment while maintaining the durability and availability guaranteed by BookKeeper.
  • These additions make it easier to patch, upgrade, and restart clusters with minimal impact on consuming services.

At Salesforce, we required a storage system which could work with two kinds of streams, one stream for write-ahead logs and one for data. But we have competing requirements from both of the streams. The write-ahead log stream should be low latency for writes and high throughput for reads. The data stream should have high throughputs for writes, but have low random read latency. Being the pioneers in cloud computing, we also required our storage system to be cloud-aware as the requirements of availability and durability are ever more increasing. Our deployment models are designed to be immutable in order to scale to massive levels and run on commodity hardware.

The Opensource alternative

Our initial research for such a storage system confronted us with the question of build versus buy.

We decided to go with an open-source storage system given all of our expectations and requirements hinge on the main business drivers of time-to-market, resources, cost etc.

After researching what open source had to offer, we settled upon two finalists: Ceph and Apache BookKeeper. With the requirement that the system be available to our customers, scale to massive levels and also be consistent as a source of truth, we needed to ensure that the system can satisfy aspects of the CAP Theorem for our use case. Let’s take a bird’s-eye view of where BookKeeper and Ceph stand in regard to the CAP Theorem (Consistency, Availability and Partition Tolerance) and our unique requirements.

While Ceph provided Consistency and Partition Tolerance, the read path can provide Availability and Partition Tolerance with unreliable reads. There’s still a lot of work required to make the write path provide Availability and Partition Tolerance. We also had to keep in mind the immutable data requirement for our deployments.

We determined Apache BookKeeper to be the clear choice for our use case. It’s close to being the CAP system we require because of its append only/immutable data store design and a highly replicated distributed log. Other key features:

  • Once a write is ACK’d, it’s always readable.
  • Once an entry is read, it’s always readable.
  • No central master, thick client, uses Apache Zookeeper for consensus and metadata.
  • No complicated hashing/computing for data placement.
     

Furthermore, Salesforce has always encouraged open source and the Apache BookKeeper community is an active and vibrant community.

Apache BookKeeper - Almost perfect but more work required

While Apache BookKeeper has checked most of the boxes for our requirements, there are still gaps. Before we get into the gaps, let's understand what Apache BookKeeper provides.

  • Every storage node is called a bookie. An ‘Ensemble’ is a set of bookies.
  • An ‘Entry’ is the smallest unit that you can write. It is also immutable.
  • A ‘Ledger’ is made up of a stream of entries which also follows an immutable model of append only.
  • The ‘write quorum’ is the number of bookies data is written to or replicated on - the maximum replication for the entry. 
  • The ‘ack quorum’ is the number of bookies on which the data is acknowledged before confirming the write - the minimum replication for the entry.
     

From a durability standpoint, ledgers are replicated across an ensemble of bookies, and entries within the ledgers are also striped across the ensemble.

Ensemble Size: 5 Write Quorum Size: 3 Ack Quorum Size: 2

Writes are confirmed based on configurable write and ack quorums. This provides low write latencies and the ability to scale.

However, it proved challenging to run BookKeeper on commodity hardware in the cloud.

The data placement policies are not natively cloud-aware and do not take into consideration the underlying public cloud provider (cloud substrate). The way some users currently deploy it is by manually identifying nodes in different availability zones, logically grouping them together, and retrofitting an existing placement policy on those node groups. While this is a work-around, it does not provide any support for zone failures nor does it provide ease of use while maintaining and upgrading massive clusters.

All of the cloud substrates have notoriously seen downtime across their availability zones from time to time and the general understanding has been for the applications to design for these faults. A great example is Netflix during the Christmas of 2012 when it got affected by a Amazon Web Services Availability Zone outage. The Netflix service kept running at a limited capacity, even though the underlying public cloud infrastructure on which it depended was down.

Problems in the public cloud

The Internet--from websites, to apps, and even enterprise software--is mostly run on the infrastructure provided by public cloud providers. This is because public cloud infrastructure is easily scalable and, to an extent, less expensive to use and maintain. However, it is not without its faults, one of which is unavailability either at node, zone or region level. If the underlying infrastructure is unavailable, there is literally nothing the users can do. This may be due to an outage in certain machines, zones, or regions. It could even be due to network latencies increasing due to faulty hardware. So in the end, as applications running on top of this public cloud infrastructure, the onus is on us--the developers--to design with its faults in mind.

Apache BookKeeper doesn’t have a built-in answer for this potentiality, so we needed to design a fix.

The Salesforce re-architecture

Now that we have a general understanding of what the problems are, we started to think about how we could fix them in order to make Apache BookKeeper cloud-aware and match our requirements. We boiled down the gaps as follows:

  • Bookies need an identity in their clusters within the public cloud.
  • Data placement policies need to be designed for ensembles spread according to availability zones in order to provide better availability, easier maintenance, and smoother deployment.
  • Existing bookie functions such as reads, writes, and data replication need to be changed in-order to take advantage of a multi-zone layout and also to account for costs of data transfer across these zones.
  • All of this should be cloud substrate agnostic.
     

Let's take a look at how we addressed these gaps.

Cookies and Kubernetes for cloud awareness

The existing BookKeeper architecture provides each bookie with a unique identity that is assigned at the time of its first boot up. This identity is persisted in the metadata store (Zookeeper) and is available for access by other bookies or by clients.

The first step to making Apache BookKeeper cloud-aware is to make each of the bookies aware of their place in a cluster deployed in Kubernetes. We thought these cookies would be the best place for this information.

We added a field called the ‘networkLocation’ to the cookie which consists of the two main components that can locate a bookie - the availability zone and the upgrade domain. Working with Kubernetes allows us to be substrate agnostic, so we used the Kubernetes APIs to query the underlying availability zone information. We also generate the ‘upgradeDomain’ based on a formula which involves the hostname ordinal index. The upgradeDomain can be used for rolling upgrades without affecting the availability of the cluster.

All of this information is generated at boot up and persisted in the metadata store for access.

This information can now be used when generating ensembles, assigning bookies to ensembles, and deciding which bookies to replicate from or to replicate to.

Public cloud placement policies

Now that we have the intelligence in the client such that it can communicate with bookies in certain zones, we need to ensure that we have a data placement policy which utilizes this information. One of our unique developments is the ZoneAwareEnsemblePlacementPolicy (ZEPP). It’s a two-level hierarchical placement policy specifically designed for cloud-based deployments. ZEPP is aware of Availability Zones (AZ) and upgradeDomains (UD).

An AZ is a logical isolated data center within a region. A UD is a set of nodes in an AZ that can be brought down together with no impact to the service. It also provides heuristics to understand when a zone is down and when it’s back.

Below is a visualization of a deployment which ZEPP can use. This deployment considers the AZ and UD information from the cookies to group the bookie nodes as shown.

Availability vs Latency vs Cost

With these changes, we’ve been able to make Apache BookKeeper truly cloud-aware. However, it’s also important to consider costs when designing this architecture.Most cloud substrates charge for unidirectional data transfer out of their service, and charges are different when it's across availability zones. This is an important factor to consider for the BookKeeper client since it currently picks an arbitrary bookie from an ensemble to satisfy a read.

If that bookie is from another availability zone than the client, this would result in unnecessary charges. Data replication would now occur between bookies which are spread across availability zones and would result in more costs in scenarios where an availability zone goes down.

Let’s take a look at how we handled these specific scenarios.

Reordered Reads

Currently the BookKeeper client picks an arbitrary bookie from an ensemble to satisfy a read.

With our reordered reads feature, the client now picks a bookie such that it minimizes read latencies while reducing costs.

With reordered reads feature switched on the client picks bookies in the following ordering:

  • From the local zone in order of bookies which can satisfy the request and have fewer pending requests.
  • From a remote zone in order of bookies which can satisfy the request and have fewer pending requests.
  • Next bookie from the local zone with least failures or pending requests above a configured threshold
  • Next bookie from the remote zone with least failures or pending requests above a configured threshold

In this order, we satisfy the latency vs cost requirements with a decent trade-off in a system that has been running for a long time and has experienced faults.

Handling a zone failure

In case of a zone down scenario, now all bookies from different ensembles would start replicating their data to bookies in currently available zones to satisfy ensemble size constraints and quorum requirements, causing a ‘thundering herd problem.’

The way we approached this problem is to first decide when a zone is actually down. Failures can be transient blips; we don’t want to start replicating terabytes of data just because of a network blip causing a zone to be unavailable. At the same time, we want to be ready for the time when there is a real failure.

Our solution has two phases:

  • Identify when a zone is really down vs temporary blips.
  • Move large scale auto-replication of an entire zone to a manual operation.
     

The below diagram shows the heuristics we consider when declaring a zone down vs the zone coming back up.

HighWaterMark and LowWaterMark are two values that are computed based on the number of bookies available in a zone vs the total bookies in that zone. The thresholds for these two values are configurable so users can decide the sensitivity of failure with respect to what they deem to be a failure.

We also disable auto replication when a zone is marked as down to avoid the automatic replication of terabytes of data across zones. We added alerts in its place to alert the user of a possible zone failure. We believe an operations expert would be able to differentiate the noise from a real failure and make the call on whether to start auto replication of an entire zone.

We also provide bookie shell commands that kick-start the auto replication that had been disabled.

What we learned

Apache BookKeeper is a very active open source project and has an amazingly supportive community with vibrant discussions for all of the challenges faced by the system. As it is a source of truth for many of its users, it needs to become cloud-aware.

However, such an architectural change comes with trade-offs and deciding factors at every level - availability, latency, costs, ease of deployment, and maintenance. The above considerations and changes have been battle-tested at Salesforce, and we are now able to support AZ and AZ+1 failure using Apache BookKeeper. We have already merged a few of our changes and will continue to contribute more to the community in the coming releases. These additions aim to make it easier to patch, upgrade, and restart clusters with minimal impact on consuming services.

About the Author

Anup Ghatage is a software developer based in San Francisco.
He enjoys challenging problems in maintaining and developing systems which are highly scalable. He’s currently a Lead Member of Technical Staff at Salesforce, where he works on cloud infrastructure and data engineering. He’s also a committer and an active contributor at Apache Bookkeeper. He has previously worked at SAP and Cisco Systems and holds a Bachelors in Computer Science from the University of Pune and a Masters from Carnegie Mellon University. You can reach Anup on Twitter at @ghatageanup.

Rate this Article

Adoption
Style

BT