Allegro experimented with different performance optimization options to improve Apache Kafka producer tail latency and eventually switched all its clusters to the XFS filesystem. The company used Kafka protocol sniffing, JVM profiling, and eBPF, which proved instrumental in identifying and eliminating performance bottlenecks.
Allegro uses Apache Kafka extensively as the backbone for asynchronous communication between microservices in its platform. The company was looking to start a new project where low-latency messaging was a critical requirement. After reviewing Kafka producer latency metrics, engineers discovered that p99 latency was 1 second and p999 latency was up to 3 seconds. Tail latencies as high as 3 seconds were unacceptable for the new functionality, and the team was asked to identify and address the problem.
Maciej Mościcki and Piotr Rżysko, software engineers at Allegro, considered different approaches toward instrumenting Kafka to find the root cause of high tail latency:
To pinpoint the underlying problem, we decided to trace individual requests. By analyzing components of Kafka involved in handling produce requests, we aimed to uncover the source of the latency spikes. One way of doing that would be to fork Kafka, implement instrumentation, and deploy our custom version to the cluster. However, this would be very time-consuming and invasive. We decided to try an alternative approach.
After analyzing the network traffic on selected Kafka brokers, engineers wanted to understand latencies related to file system operations so they used the ext4slower tool, which provides tracing for ext4, Linux’s default file system. The tool leverages eBPF (extended Berkeley Packet Filter), which can be used to build networking, security and observability tools for the Linux kernel.
Lock Contention Contributing to Slow Writes (Source: Allegro Technology Blog)
The team used async-profiler, eBPF tools, and networking data to attribute some of the high latency cases to lock contention in the Kafka broker code. Furthermore, the engineers leveraged ebpf_explorer to expose eBPF-based metrics in the Prometheus format so they could be visualized in Grafana.
With more in-depth analysis, journal commits were identified as the main source of latency, and the team went on to explore different possible optimizations with the ext4 file system. Using writeback journaling mode improved p999 latency to 800 milliseconds and enabling the fast commit, a newer journaling mechanism, introduced in Linux 5.10, achieved a p999 latency of 500 milliseconds.
While exploring improvements around different file system journaling mechanisms and settings in ext4, the engineers found suggestions about the XFS file system offering more advanced journaling and also being recommended in the Apache Kafka documentation.
Heat Maps of Latency Outliers For Different Optimizations (Source: Allegro Technology Blog)
The team tested out XFS, and the results indicated much-improved tail latency characteristics. After gaining confidence in performance results, engineers migrated all Kafka brokers to XFS and observed an 82% reduction in Kafka producer latency outliers exceeding the SLO of 65 milliseconds.
Engineers emphasized that analyzing latency metrics alone wasn’t enough, and tracing individual requests was fundamental to determining the root cause of slow writes. Using eBPF and related tooling was essential to capture and expose detailed latency metrics for file system operations.