Uncovering a Stealth Performance Bottleneck in ClickHouse: How We Restored Our Billing Pipeline

Introduction: When the Billing Pipeline Slowed to a Crawl

At Cloudflare, millions of daily ClickHouse queries determine how much customers are billed for their usage of our products. This pipeline processes hundreds of millions of dollars in revenue, so any delay has serious downstream consequences—reconciling invoices becomes nearly impossible. So when our daily aggregation jobs ground to a halt after a routine migration, we knew we had a major problem.

Uncovering a Stealth Performance Bottleneck in ClickHouse: How We Restored Our Billing Pipeline — Source: blog.cloudflare.com

All the usual suspects were checked: I/O, memory, rows scanned, parts read—nothing appeared abnormal. Yet queries remained sluggish. This article tells the story of how we traced the slowdown to a hidden bottleneck deep inside ClickHouse’s internals, and the three patches we deployed to fix it.

The Setup: A Petabyte-Scale Analytics Platform

Cloudflare relies heavily on ClickHouse, an open-source OLAP database, to store over a hundred petabytes of data across several clusters. To simplify onboarding for internal teams, we built a system called “Ready-Analytics” in early 2022. Instead of designing custom tables, teams stream data into one massive table. Datasets are differentiated by a namespace, and each record follows a standard schema—typically twenty float fields, twenty string fields, a timestamp, and an indexID.

Sorting is critical for ClickHouse performance. The indexID is a string that forms part of the primary key, allowing each namespace’s data to be sorted optimally for its anticipated queries. The final primary key looks like: (namespace, indexID, timestamp). By December 2024, the system had grown to over 2 PiB of data, ingesting millions of rows per second. Yet it had a critical flaw: its retention policy.

The Problem: One Retention Policy to Rule Them All

Cloudflare began using ClickHouse years ago, long before native TTL features existed. We built a custom retention system based on partitions. The Ready-Analytics table was partitioned by day, and a daily job simply dropped partitions older than 31 days.

This “one-size-fits-all” 31‑day retention was a major limitation. Some teams needed to store data for years due to legal or contractual obligations; others required only a few days. The restriction forced those with different needs to abandon Ready-Analytics and adopt a conventional setup, which involved a far more complex onboarding process. Clearly, we needed a new system that allowed per-namespace retention.

The Investigation: When Normal Metrics Deceive

After the migration that introduced per-namespace retention (or so we thought), the billing aggregation jobs slowed significantly. We began by examining all typical performance counters: I/O wait times, memory consumption, rows scanned, number of parts read. Everything looked normal—no obvious resource saturation, no unusual query plans.

This was puzzling. Highly redundant monitoring and metrics hadn’t flagged anything wrong. We then dug deeper into ClickHouse’s query execution engine, tracing individual query steps and analyzing partition pruning behavior. It became clear that the bottleneck wasn’t at the hardware or configuration level—it was buried within the software itself.

Uncovering the Hidden Bottleneck

Through careful profiling and reading ClickHouse’s source code, we discovered that the new retention logic—while correct—introduced an unexpected interaction with how ClickHouse merges parts and evaluates partitions. Specifically, the per-namespace retention required additional metadata lookups for every query, which were not optimized for the high concurrency of the billing pipeline. This lookup overhead became the hidden bottleneck, especially under the heavy load of daily aggregation jobs.

The bottleneck was invisible to standard metrics because it did not manifest as high I/O or CPU—it appeared as increased query latency due to a single-threaded metadata processing step that had not been parallelized.

The Three Patches That Fixed It

Once we identified the root cause, we implemented three targeted patches to resolve the issue:

Optimized metadata caching: We introduced a cache for per-namespace retention metadata to avoid redundant lookups, reducing the overhead per query.
Parallel partition pruning: The metadata processing step was refactored to use multiple threads, significantly cutting down the time spent on partition evaluation for large tables.
Query plan improvements: We adjusted ClickHouse’s query optimizer to prefer partition pruning strategies that minimized the impact of the new retention system on high-frequency billing queries.

These patches were deployed incrementally, and each brought measurable improvements. After the final patch, our daily aggregation jobs returned to normal speed—and the invoices went out on time.

Lessons Learned

This experience reinforced an important principle: performance problems may lurk where traditional monitoring can’t see them. By combining deep knowledge of ClickHouse internals with methodical investigation, we not only fixed the billing pipeline but also contributed improvements upstream. The per-namespace retention system is now live, enabling many more teams to use Ready-Analytics with the flexibility they need.

If you’re running ClickHouse at scale, remember—always look beyond the obvious counters, and don’t be afraid to dive into the source code when nothing else explains the slowdown. The bottleneck might be hidden, but it can be found and fixed.