2.71 GiB → 166 MiB: How One Column Was Eating 94% of Our Storage
170 million rows. 2.71 GiB on disk. We dropped one column and changed some types. Same data, same queries — 166 MiB. Here's what we learned about ClickHouse compression the hard way.
2.71 GiB → 166 MiB: How One Column Was Eating 94% of Our Storage
We load-tested our analytics pipeline with 170 million events. Three months of simulated click data. The numbers came back:
Compressed (on disk): 2.71 GiB
Uncompressed: 16.69 GiB
Compression ratio: 6.16x
Per-row (on disk): 16.8 bytes
A 6x compression ratio. Decent—but ClickHouse is built for columnar compression. Well-tuned analytics tables commonly hit 15-30x. We knew we could do a lot better.
So we looked at per-column storage. And one column made us stare at the screen for a while.
94% of Our Disk Was One Column
Every row had a UUID. A random, unique identifier generated for every event. Standard practice—every table needs a primary key, right?
We checked per-column disk usage:
The UUID column: 2.55 GiB → 94% of total disk
Everything else: ~160 MiB → 6%
Read that again. One column consumed 94% of our storage. Every other column in the table—all of them combined—compressed down to roughly 160 MiB. The UUID was 16x larger than everything else put together.
Why? Because UUIDs are random. Every value is unique. Every byte is entropy. Compression algorithms work by finding patterns, and random data has none. LZ4 looked at 170 million unique 16-byte values and essentially gave up.
170,000,000 × 16 bytes = 2.55 GiB of incompressible noise
Meanwhile, the columns we actually queried—the ones powering our dashboards and aggregations—were compressing beautifully. Some were achieving ratios we didn't think were possible. Hundreds-to-one. The compressor was doing incredible work on everything except the random UUID.
And the UUID? We never queried it. Not once. Not in any dashboard. Not in any API endpoint. Not in any aggregation. We had other columns handling ordering and lookups. The UUID was a habit from relational database thinking, carried over into a columnar world where it became our single biggest cost.
We Dropped It
One migration. One DROP COLUMN. The results were immediate.
The Type Overhaul
With the UUID gone, we looked at what was left. Most columns were String type. ClickHouse was already compressing them well, but we knew we could do better by being specific about what our data actually looked like.
We went column by column:
Columns with a few hundred unique values across 170 million rows? Wrapped them in
LowCardinality. Instead of storing the full string per row, ClickHouse builds a dictionary and stores a 1-byte index. The savings are absurd when you have 200 unique values repeated across 170 million rows.Columns with a tiny fixed set of possible values? Replaced with
Enum8—one byte per row, with schema-level validation that rejects bad data at insert time.Columns with fixed-length values? Switched to
FixedString(N). No length prefix overhead. Uniform arrays that compress better than variable-length data.Structured data stored as strings? Replaced with native types. Fixed-width binary representations that are smaller and denser.
These changes were boring. No algorithmic breakthroughs. Just telling ClickHouse what we already knew about our data.
The Numbers
Same 170 million rows. Same data. Same queries. Same dashboards.
| Metric | Before | After | Change |
|---|---|---|---|
| Compressed (disk) | 2.71 GiB | 166 MiB | -93.9% |
| Uncompressed | 16.69 GiB | 14.14 GiB | -15.2% |
| Compression ratio | 6.16x | 88.6x | 14x better |
| Per-row (disk) | 16.8 bytes | 1.02 bytes | -94% |
From 6x compression to 89x compression. From 16.8 bytes per row to just over 1 byte per row on disk.
The entire analytics table—170 million click events with full metadata—now fits in 166 MiB. That's smaller than what the UUID column alone used to cost.
One Byte Per Row Changes Everything
At ~1 byte per row compressed, the scaling math gets fun:
| Rows | Disk |
|---|---|
| 170M (current) | 166 MiB |
| 500M | ~490 MiB |
| 1 billion | ~1 GiB |
| 10 billion | ~10 GiB |
A billion rows in a gigabyte. Ten billion in ten. We went from worrying about disk provisioning to realizing storage is no longer a meaningful cost in our infrastructure.
It's Faster Too
Less data on disk means less data to read. We ran some queries on the optimized table to see what that looks like in practice:
| Query | Rows Scanned | Time |
|---|---|---|
| Full-column scan, 1M rows | 1,000,000 | 47ms |
| 30-day time-series | 170,500,000 | 284ms |
| Full-table multi-column aggregation | 170,500,000 | 1.76s |
A 30-day time-series across all 170 million rows in 284 milliseconds. A full-table multi-column aggregation in under 2 seconds. A million-row scan in 47ms.
Before the optimization, every one of those queries was dragging 2.55 GiB of random UUID data through memory. Now the entire table on disk is smaller than that one column used to be.
Dictionary-encoded columns also get compared using integer operations instead of string matching. Less data to read and cheaper comparisons per row. The wins compound.
The Lesson
We almost didn't look. The table was working. Queries were fast enough. Disk wasn't full yet. We would have happily run at 6x compression for months without questioning it.
The fix took one afternoon:
Drop the column nobody queried. A random UUID in a columnar analytics database is almost pure waste. If you're not filtering, grouping, or selecting by it—it's just dead weight that every scan has to skip over. In our case, it was 94% of our storage cost.
Use specific types instead of String.
LowCardinalityfor columns with few unique values.Enum8for tiny fixed sets.FixedStringfor fixed-length data. Native types for structured data. Don't make the compressor guess what you already know.
The uncompressed data only dropped 15%—from 16.69 GiB to 14.14 GiB. The real magic was in compression. By removing incompressible data and giving ClickHouse better type hints, the ratio jumped from 6x to 89x.
Your table probably has a column like our UUID. Something that seemed reasonable, that nobody questions, that's quietly eating your disk budget. Go check system.columns and sort by data_compressed_bytes. You might be surprised what you find.
QCK is a URL shortener with built-in real-time analytics. Try it free — 25 links, full analytics, no credit card.