Yelp reworked its data streaming architecture by employing Apache Beam and Apache Flink. The company replaced a fragmented set of data pipelines for streaming transactional data into its analytical systems, like Amazon Redshift and in-house data lake, using Apache data streaming projects to create a unified and flexible solution.
Yelp manages the properties of business entities, one of the primary data entities in its platform, in two different online systems. The legacy part of the platform stores business properties in the MySQL database, whereas a newer part that adopted microservices architecture uses Cassandra for storage.
Historically, the solution to stream data from online databases to offline (analytical) ones consisted of separate data pipelines for both areas that manage business properties. The solution used MySQL Replication Handler to push data from the legacy system and Cassandra Source Connector to push data from the new system. In both cases, updates were published into Apache Kafka and Redshift Connector synchronized the data to the corresponding Redshift tables.
Previous Streaming Architecture for Business Properties (Source: Yelp Engineering Blog)
The original solution with separate data pipelines, streaming data across from online databases to analytical data stores demonstrated weak encapsulation as data tables in offline (analytical) data stores mirrored exactly corresponding tables in online databases, exposing data analytics teams to data discrepancies and data accuracy issues. Additionally, analytical processes had to collect data from multiple tables and normalize these to a consistent format. Lastly, since table schemas were identical between online and offline data stores, changes to schemas had to be applied in both places, introducing maintenance challenges.
The team at Yelp decided to address these problems with the original solution by abstracting away internal implementation details of online systems and providing a consistent experience for clients using analytical data stores. Hakampreet Singh Pandher, senior data engineer at Yelp, explains the approach that the team settled on:
[...] we implemented a unified stream that delivers all relevant business property data in a consistent and user-friendly format. This approach ensures that Business Property consumers are spared from navigating the nuances between Business Attributes and Features or understanding the intricacies of data storage in their respective online source databases.
The team leveraged Apache Beam with Apache Flink as a distributed processing backend. Apache Beam transformation jobs sourced data from legacy MySQL and newer Cassandra tables, transformed the data into a consistent format and published it into a single unified stream. Engineers employed a Joinery Flink job to merged business properties data with corresponding metadata. Another job was used to address issues with data inconsistencies and finally with, the help of Redshift Connector and Data Lake Connector, business properties data landed in the two main offline data stores.
New Streaming Architecture for Business Properties (Source: Yelp Engineering Blog)
The overall impact of overhauling the streaming architecture was enabling data analytics teams to access business properties data through a single schema, which helped with data discovery and ease of consumption. The team also leveraged the entity-attribute-value (EAV) model to help incorporate new business properties into the system with reduced maintenance overhead.