Reddit's Safety Engineering team recently published how it modernised its Rule-Execution system, which detects and acts on policy-violating content in real time. The new architecture includes improvements like transitioning from legacy EC2-based systems to Kubernetes, better rule version control with Github and S3 storage, and the capability to scale more efficiently with Flink Stateful Functions.
Rule Execution V2 (REV2) is Reddit's modernised rules-engine system to detect and act on policy-violating content. At its core, REV2 uses Lua scripts, termed "rules", that are triggered based on specific events from Kafka, such as a user posting or commenting.
REV2's Architecture (Source)
Flink Stateful Functions and Reddit's Baseplate application framework execute these Lua rules on message streaming from Apache Kafka topics. Rules are primarily configured through code and stored in GitHub for version control. REV2's CI pipeline then persists these rules to S3, which is checked periodically for rule updates.
The entire system runs on Kubernetes. REV2 sends structured Protobuf actions to specific Kafka topics when an action is determined. Each action type is associated with a unique topic, ensuring tighter schemas and more granular monitoring. A Safety Actioning Worker consumes these topics and carries out the specified action.
REV1's Architecture (Source)
REV1, Reddit's initial rules engine, operated on raw EC2 instances instead of Kubernetes and ran on the deprecated Python 2.7. Each rule in REV1 ran as an independent process, necessitating costly vertical scaling as the safety staff added more rules.
Rule change history was challenging to track since there was no version control and no staging environment for sandbox testing. A web interface allowed editing rules, and when modified, it sent an update to ZooKeeper, which served as the storage for the rules.
Additionally, REV1 wrote actions to a single Kafka topic consumed by Reddit's older monolithic web application, R2. This coupling was an issue since Reddit deprecates R2 in favour of other microservices.
Flink Stateful Functions is a prominent feature used in Reddit's new system. At its core, Flink Stateful Functions enables the separation of an application's streaming layer from its business logic. When the system receives a message through a Kafka ingress, Flink forwards it to a remote service endpoint for processing and outputs a resultant message to a Kafka egress if necessary.
This modular approach allows the streaming tier and the web application to scale independently. Furthermore, it enables the web application to be written in any language as long as it can accommodate requests sent by Flink. In Reddit's REV2, this has notably addressed previous challenges by facilitating swift horizontal scaling during traffic surges.
One of the new features introduced with REV2 is the Time-Travel feature. This feature allows retroactive action on policy-violating content created before a particular rule's development.
When a Safety Operator utilises this feature, they can specify a starting date and time from which a rule begins its execution. This instruction prompts a Flink deployment in the backend to adjust the rule consumer group's offset to the designated startup position. This process creates a large backlog of historical events that the system processes, thus actioning and "cleaning up" the website in retrospect.
Reddit's REV2 system sought to optimise its deployment process to ensure faster and more efficient rule updates. Initially, each rule update necessitated both Flink and Baseplate redeployments, a cumbersome and slow process. To address this, Reddit introduced a polling setup using Amazon S3. The system's Continuous Integration (CI) phase uploads a zip file containing all rule configurations during deployment.
S3-based rule-polling architecture (Source)
A Kubernetes sidecar process then periodically checks S3 for any changes to this file. If updates are detected, they are downloaded and unzipped to a shared Kubernetes volume.
The Baseplate application, equipped with file-watchers, can dynamically serve these rule updates without needing full redeployment. This S3-based approach has enhanced the deployment time for rule edits by approximately 90%, providing a much more efficient iteration rate.