When time series databases face high-cardinality workloads—millions of unique series from fields like request IDs or user tokens—the traditional series-oriented storage layout breaks down. Memory bloats, scans degrade to near row-store performance, and dedup costs explode.
GreptimeDB 1.0 beta introduces flat format: a new storage layout with a redesigned memtable (BulkMemtable) and multi-series merge path. In high-cardinality workloads, we see 4× better write throughput and up to 10× faster queries, while maintaining comparable performance in traditional scenarios.
This article explains the motivation behind flat format, how it works internally, and when you should use it.
Background
GreptimeDB organizes data using an LSM-Tree[1] structure. Incoming writes append to the WAL[2] and go into an in-memory memtable. Once the memtable crosses a size threshold, GreptimeDB flushes it to Apache Parquet[3] files and truncates the WAL. Data is automatically partitioned by time—typically one partition per day—with each partition maintaining its own memtables and Parquet files.
+------------------+ +------------------+ +------------------+
| Time Partition 1 | | Time Partition 2 | | Time Partition 3 |
| (Day 1) | | (Day 2) | | (Day 3) |
| +──────────────+ | | +──────────────+ | | +──────────────+ |
┌─────>| | Memtable | | ┌──>| | Memtable | | ┌──>| | Memtable | |
│ | +──────────────+ | │ | +──────────────+ | │ | +──────────────+ |
│ | | | │ | | | │ | | |
│ | v flush | │ | v flush | │ | v flush |
│ | +──────────────+ | │ | +──────────────+ | │ | +──────────────+ |
│ | | Parquet Files| | │ | | Parquet Files| | │ | | Parquet Files| |
│ | | ┌─────┬────┐ | | │ | | ┌─────┬────┐ | | │ | | ┌─────┐ | |
│ | | │SST 1│SST2│ | | │ | | │SST 1│SST2│ | | │ | | │SST 1│ | |
│ | | └─────┴────┘ | | │ | | └─────┴────┘ | | │ | | └─────┘ | |
│ | | ┌─────┐ | | │ | | ┌─────┐ | | │ | | | |
│ | | │SST 3│ | | │ | | │SST 3│ | | │ | | | |
│ | | └─────┘ | | │ | | └─────┘ | | │ | | | |
│ | +──────────────+ | │ | +──────────────+ | │ | +──────────────+ |
│ +------------------+ │ +------------------+ │ +------------------+
│ │ │
+─┴────────────────────────────┴─────────────────────────┴─+
| WAL |
+──────────────────────────────────────────────────────────+
^
│
DataUsers define a primary key (PK) when creating a table, which determines the table's time series[4]. Each unique PK value corresponds to one series. Within memtables and Parquet files, rows belonging to the same series are stored contiguously. By default, GreptimeDB uses last-write-wins semantics for rows with identical PK and timestamp; users can enable append mode to keep all such rows.
In the primary-key format—the original layout (SQL parameter value: 'primary_key')—GreptimeDB processes data on a per-series basis:
- The memtable allocates a separate buffer for each series
- The scan path is optimized around single-series reads
- All tag columns (columns in the PK) are encoded into a single binary column for efficient comparison
This design works well when a time partition contains up to a few hundred thousand series. But when users include high-cardinality columns in the PK—request IDs, trace IDs, or user tokens—things break down:
- Maintaining millions of per-series buffers in the memtable wastes memory
- As series count grows and per-series data shrinks, scan efficiency degrades toward row-store levels
- With dedup enabled, merge and dedup costs grow roughly linearly with series count
Flat format addresses these issues with a new storage layout and query path that handles high-cardinality workloads efficiently while avoiding regression in traditional scenarios.
Flat Format
Compared to the primary-key format, flat format introduces three key changes:
- A new Parquet layout that stores tag columns individually
- A new BulkMemtable that doesn't maintain per-series structures
- A multi-series merge and dedup path that reduces overhead at scale
Storage Layout
Consider a table with this schema:
CREATE TABLE IF NOT EXISTS `cpu` (
`hostname` STRING NULL,
`region` STRING NULL,
`datacenter` STRING NULL,
`team` STRING NULL,
`usage_user` BIGINT NULL,
`usage_system` BIGINT NULL,
`usage_idle` BIGINT NULL,
`greptime_timestamp` TIMESTAMP(9) NOT NULL,
TIME INDEX (`greptime_timestamp`),
PRIMARY KEY (`hostname`, `region`, `datacenter`, `team`)
);In the primary-key format, Parquet files look like this:
┌────────────┬──────────────┬────────────┬────────────────────┬───────────────┬────────────┬───────────┐
│ usage_user │ usage_system │ usage_idle │ greptime_timestamp │ __primary_key │ __sequence │ __op_type │
├────────────┼──────────────┼────────────┼────────────────────┼───────────────┼────────────┼───────────┤
│ 10 │ 5 │ 85 │ 2024-01-01 00:00 │ key0 │ 1 │ PUT │
│ 12 │ 6 │ 82 │ 2024-01-01 00:01 │ key0 │ 2 │ PUT │
│ 15 │ 8 │ 77 │ 2024-01-01 00:02 │ key0 │ 3 │ PUT │
│ 20 │ 10 │ 70 │ 2024-01-01 00:00 │ key1 │ 1 │ PUT │
│ 22 │ 11 │ 67 │ 2024-01-01 00:01 │ key1 │ 2 │ PUT │
└────────────┴──────────────┴────────────┴────────────────────┴───────────────┴────────────┴───────────┘All tag columns (hostname, region, datacenter, team) are packed into a single binary column __primary_key. Data is sorted by PK so rows from the same series appear together.
This simplifies per-series processing and saves space. But it has downsides:
- Packed tags prevent Parquet's per-column statistics from being used for predicate pushdown—only the first tag's stats help with filtering
- Reading one tag requires decoding the entire PK
- Third-party tools can't easily analyze the Parquet files
Flat format keeps the benefits of series-aware sorting while making tags first-class columns:
┌──────────┬────────┬────────────┬──────┬────────────┬──────────────┬────────────┬────────────────────┬───────────────┬────────────┬───────────┐
│ hostname │ region │ datacenter │ team │ usage_user │ usage_system │ usage_idle │ greptime_timestamp │ __primary_key │ __sequence │ __op_type │
├──────────┼────────┼────────────┼──────┼────────────┼──────────────┼────────────┼────────────────────┼───────────────┼────────────┼───────────┤
│ host1 │ cn │ dc1 │ t1 │ 10 │ 5 │ 85 │ 2024-01-01 00:00 │ key0 │ 1 │ PUT │
│ host1 │ cn │ dc1 │ t1 │ 12 │ 6 │ 82 │ 2024-01-01 00:01 │ key0 │ 2 │ PUT │
│ host1 │ cn │ dc1 │ t1 │ 15 │ 8 │ 77 │ 2024-01-01 00:02 │ key0 │ 3 │ PUT │
│ host2 │ us │ dc2 │ t2 │ 20 │ 10 │ 70 │ 2024-01-01 00:00 │ key1 │ 1 │ PUT │
│ host2 │ us │ dc2 │ t2 │ 22 │ 11 │ 67 │ 2024-01-01 00:01 │ key1 │ 2 │ PUT │
└──────────┴────────┴────────────┴──────┴────────────┴──────────────┴────────────┴────────────────────┴───────────────┴────────────┴───────────┘Storing tags as individual columns enables:
- Per-column statistics for predicate pushdown on any tag, not just the first one
- Independent filtering on each tag column
- Reading specific tags without decoding the full PK
- Easy analysis with standard Parquet tools
The encoded __primary_key column is retained for efficient sorting, merging, and dedup. Although this introduces some redundancy, Parquet's columnar storage and dictionary encoding keep file sizes nearly identical—within 1-2% in TSBS benchmarks.
Since flat format files are essentially the primary-key format plus additional columns, backward compatibility is straightforward.
BulkMemtable
At high series counts, allocating a separate buffer for each series becomes inefficient—each series incurs fixed memory overhead regardless of how much data it holds. BulkMemtable borrows from LSM-Tree designs by treating incoming data as independent parts rather than maintaining per-series structures.
Incoming writes are converted into sorted BulkPart structures and appended to a part list.
Each BulkPart stores data in an Apache Arrow[5] RecordBatch[6] using the flat schema. Rows are sorted by (primary_key, timestamp, sequence DESC).
+-----------------------------------------------------------------------------------------------+
| BulkPart |
+-----------------------------------------------------------------------------------------------+
| min_timestamp: 2024-01-01 00:00 |
| max_timestamp: 2024-01-01 00:02 |
| sequence: 3 |
+-----------------------------------------------------------------------------------------------+
| RecordBatch (Arrow) |
| ┌──────────┬────────┬────────────┬─────┬───────────────┬────────────┬───────────┐ |
| │ hostname │ region │ usage_user │ ... │ __primary_key │ __sequence │ __op_type │ |
| ├──────────┼────────┼────────────┼─────┼───────────────┼────────────┼───────────┤ |
| │ host1 │ cn │ 10 │ ... │ key0 │ 1 │ PUT │ |
| │ host1 │ cn │ 12 │ ... │ key0 │ 2 │ PUT │ |
| │ host2 │ us │ 20 │ ... │ key1 │ 1 │ PUT │ |
| │ host2 │ us │ 22 │ ... │ key1 │ 2 │ PUT │ |
| └──────────┴────────┴────────────┴─────┴───────────────┴────────────┴───────────┘ |
| ↑ |
| Sorted by (primary key, timestamp, seq desc) |
+-----------------------------------------------------------------------------------------------+Parts are organized by size:
+------------------+
| BulkMemtable |
+------------------+
|
v
+------------------+
| BulkParts |
+------------------+
| |
| +-------------+ |
| |UnorderedPart|--|--> small parts
| +-------------+ |
| | |
| v |
| +-------------+ |
| | parts |--|--> large parts
| | (Vec) | |
| +-------------+ |
| | |
| v |
| +-------------+ |
| |encoded_parts|--|--> parts encoded in Apache Parquet format
| | (Vec) | |
| +-------------+ |
+------------------+UnorderedPart optimizes small-batch writes. When an incoming batch has fewer than 1,024 rows (the default threshold), the BulkPart is cached in UnorderedPart. Once the cache hits 4,096 rows, it's merge-sorted into a single BulkPart. This avoids the overhead of repeatedly merging many tiny parts. These thresholds can be tuned for specific workloads.
Large BulkParts go directly into the parts list. When the list grows too long, a merge is triggered. The merge sorts parts by row count and combines similarly-sized parts, keeping data sorted. If dedup is enabled (append mode disabled), duplicates are removed during merging.
After merging, BulkParts are encoded into EncodedBulkParts—in-memory Parquet files. Parquet encoding and compression further reduce memory footprint.
Bottom line: larger write batches mean better throughput and less internal merge overhead with BulkMemtable.
Performance
We benchmarked flat format against the primary-key format using TSBS[7], an industry-standard time-series benchmark suite.
Test environment
| CPU | AMD Ryzen 7 7735HS, 16C @ 3.2GHz |
| Memory | 32GB |
| Disk | SOLIDIGM SSDPFKNU010TZ |
| OS | Ubuntu 22.04.2 LTS |
Write throughput at 4K series
At low cardinality, flat format shows some write performance regression compared to primary-key format, though larger batch sizes mitigate this:
| Client Configuration | primary-key (rows/s) | flat (rows/s) |
|---|---|---|
| 6 workers, batch 3,000 | 401,416 | 371,473 |
| 6 workers, batch 20,000 | 402,473 | 407,408 |
| 6 workers, batch 30,000 | 412,436 | 405,963 |
| 12 workers, batch 20,000 | 441,445 | 478,442 |
Write throughput at 2M series
At high cardinality, flat format pulls far ahead:
| Client Configuration | primary-key (rows/s) | flat (rows/s) |
|---|---|---|
| 6 workers, batch 3,000 | 87,141 | 363,741 (4.2×) |
Query latency at 2M series
Several queries show dramatic improvements:
| Query | primary-key (ms) | flat (ms) | Speedup |
|---|---|---|---|
| cpu-max-all-1 | 530.62 | 364.69 | 1.5× |
| cpu-max-all-8 | 17,317.18 | 3,522.58 | 4.9× |
| high-cpu-1 | 542.29 | 291.12 | 1.9× |
| single-groupby-1-1-1 | 124.65 | 60.52 | 2.1× |
| single-groupby-1-1-12 | 408.15 | 437.35 | 0.9× |
| single-groupby-1-8-1 | 3,554.64 | 430.95 | 8.2× |
| single-groupby-5-1-1 | 147.15 | 67.84 | 2.2× |
| single-groupby-5-1-12 | 492.27 | 476.50 | 1.0× |
| single-groupby-5-8-1 | 4,722.53 | 467.78 | 10.1× |
Usage and Best Practices
If your workload includes high-cardinality columns in the primary key, flat format is worth trying. In one real case, a user hit a write bottleneck with primary-key format—despite moderate traffic, each flush processed over 14 million series, making writes unusable. Switching to flat format resolved the issue.
Option 1: Create a new table with flat format
Specify sst_format='flat' in the CREATE TABLE statement[8]:
CREATE TABLE `http_logs` (
`access_time` TIMESTAMP TIME INDEX,
`application` STRING,
`remote_addr` STRING,
`http_status` STRING,
`http_method` STRING,
`http_refer` STRING,
`user_agent` STRING,
`request_id` STRING,
`request` STRING,
PRIMARY KEY(`application`, `request_id`)
) with ('append_mode'='true', 'sst_format'='flat');For example, an http_logs table where queries filter by application and request_id can safely include request_id in the PK. With the primary-key format, this might cause OOM; flat format handles it fine.
Option 2: Convert an existing table
Use ALTER TABLE[9]:
ALTER TABLE cpu SET 'sst_format' = 'flat';Note: Once converted to flat format, a table cannot be reverted to primary-key format, as the primary-key format cannot read data files generated by flat format.
Best practices
- Under a million series? No need to migrate—primary-key format works well
- Using flat format? Update to the latest GreptimeDB version
- Maximize throughput: Use batch sizes of 10,000+ rows
- Dedup overhead is unavoidable: Even with flat format, dedup costs scale with data volume. Append-only tables still deliver the best query performance
Since flat format is a superset of the primary-key format, we plan to make it the default over time. We're continuing to test and optimize the flat format paths—feedback and benchmarks from real workloads are welcome.


