Scaling Time Series to Millions of Cardinalities: GreptimeDB's Flat Format

When time series databases face high-cardinality workloads—millions of unique series from fields like request IDs or user tokens—the traditional series-oriented storage layout breaks down. Memory bloats, scans degrade to near row-store performance, and dedup costs explode.

GreptimeDB 1.0 beta introduces flat format: a new storage layout with a redesigned memtable (BulkMemtable) and multi-series merge path. In high-cardinality workloads, we see 4× better write throughput and up to 10× faster queries, while maintaining comparable performance in traditional scenarios.

This article explains the motivation behind flat format, how it works internally, and when you should use it.

Background

GreptimeDB organizes data using an LSM-Tree^[1] structure. Incoming writes append to the WAL^[2] and go into an in-memory memtable. Once the memtable crosses a size threshold, GreptimeDB flushes it to Apache Parquet^[3] files and truncates the WAL. Data is automatically partitioned by time—typically one partition per day—with each partition maintaining its own memtables and Parquet files.

                 +------------------+      +------------------+      +------------------+
                 | Time Partition 1 |      | Time Partition 2 |      | Time Partition 3 |
                 |    (Day 1)       |      |    (Day 2)       |      |    (Day 3)       |
                 | +──────────────+ |      | +──────────────+ |      | +──────────────+ |
          ┌─────>| |   Memtable   | |  ┌──>| |   Memtable   | |  ┌──>| |   Memtable   | |
          │      | +──────────────+ |  │   | +──────────────+ |  │   | +──────────────+ |
          │      |        |         |  │   |        |         |  │   |        |         |
          │      |        v flush   |  │   |        v flush   |  │   |        v flush   |
          │      | +──────────────+ |  │   | +──────────────+ |  │   | +──────────────+ |
          │      | | Parquet Files| |  │   | | Parquet Files| |  │   | | Parquet Files| |
          │      | | ┌─────┬────┐ | |  │   | | ┌─────┬────┐ | |  │   | | ┌─────┐      | |
          │      | | │SST 1│SST2│ | |  │   | | │SST 1│SST2│ | |  │   | | │SST 1│      | |
          │      | | └─────┴────┘ | |  │   | | └─────┴────┘ | |  │   | | └─────┘      | |
          │      | | ┌─────┐      | |  │   | | ┌─────┐      | |  │   | |              | |
          │      | | │SST 3│      | |  │   | | │SST 3│      | |  │   | |              | |
          │      | | └─────┘      | |  │   | | └─────┘      | |  │   | |              | |
          │      | +──────────────+ |  │   | +──────────────+ |  │   | +──────────────+ |
          │      +------------------+  │   +------------------+  │   +------------------+
          │                            │                         │
        +─┴────────────────────────────┴─────────────────────────┴─+
        |                            WAL                           |
        +──────────────────────────────────────────────────────────+
                                     ^
                                     │
                                   Data

Users define a primary key (PK) when creating a table, which determines the table's time series^[4]. Each unique PK value corresponds to one series. Within memtables and Parquet files, rows belonging to the same series are stored contiguously. By default, GreptimeDB uses last-write-wins semantics for rows with identical PK and timestamp; users can enable append mode to keep all such rows.

In the primary-key format—the original layout (SQL parameter value: 'primary_key')—GreptimeDB processes data on a per-series basis:

The memtable allocates a separate buffer for each series
The scan path is optimized around single-series reads
All tag columns (columns in the PK) are encoded into a single binary column for efficient comparison

This design works well when a time partition contains up to a few hundred thousand series. But when users include high-cardinality columns in the PK—request IDs, trace IDs, or user tokens—things break down:

Maintaining millions of per-series buffers in the memtable wastes memory
As series count grows and per-series data shrinks, scan efficiency degrades toward row-store levels
With dedup enabled, merge and dedup costs grow roughly linearly with series count

Flat format addresses these issues with a new storage layout and query path that handles high-cardinality workloads efficiently while avoiding regression in traditional scenarios.

Flat Format

Compared to the primary-key format, flat format introduces three key changes:

A new Parquet layout that stores tag columns individually
A new BulkMemtable that doesn't maintain per-series structures
A multi-series merge and dedup path that reduces overhead at scale

Storage Layout

Consider a table with this schema:

sql

CREATE TABLE IF NOT EXISTS `cpu` (
  `hostname` STRING NULL,
  `region` STRING NULL,
  `datacenter` STRING NULL,
  `team` STRING NULL,
  `usage_user` BIGINT NULL,
  `usage_system` BIGINT NULL,
  `usage_idle` BIGINT NULL,
  `greptime_timestamp` TIMESTAMP(9) NOT NULL,
  TIME INDEX (`greptime_timestamp`),
  PRIMARY KEY (`hostname`, `region`, `datacenter`, `team`)
);

In the primary-key format, Parquet files look like this:

┌────────────┬──────────────┬────────────┬────────────────────┬───────────────┬────────────┬───────────┐
│ usage_user │ usage_system │ usage_idle │ greptime_timestamp │ __primary_key │ __sequence │ __op_type │
├────────────┼──────────────┼────────────┼────────────────────┼───────────────┼────────────┼───────────┤
│     10     │      5       │     85     │ 2024-01-01 00:00   │     key0      │     1      │    PUT    │
│     12     │      6       │     82     │ 2024-01-01 00:01   │     key0      │     2      │    PUT    │
│     15     │      8       │     77     │ 2024-01-01 00:02   │     key0      │     3      │    PUT    │
│     20     │     10       │     70     │ 2024-01-01 00:00   │     key1      │     1      │    PUT    │
│     22     │     11       │     67     │ 2024-01-01 00:01   │     key1      │     2      │    PUT    │
└────────────┴──────────────┴────────────┴────────────────────┴───────────────┴────────────┴───────────┘

All tag columns (hostname, region, datacenter, team) are packed into a single binary column __primary_key. Data is sorted by PK so rows from the same series appear together.

This simplifies per-series processing and saves space. But it has downsides:

Packed tags prevent Parquet's per-column statistics from being used for predicate pushdown—only the first tag's stats help with filtering
Reading one tag requires decoding the entire PK
Third-party tools can't easily analyze the Parquet files

Flat format keeps the benefits of series-aware sorting while making tags first-class columns:

┌──────────┬────────┬────────────┬──────┬────────────┬──────────────┬────────────┬────────────────────┬───────────────┬────────────┬───────────┐
│ hostname │ region │ datacenter │ team │ usage_user │ usage_system │ usage_idle │ greptime_timestamp │ __primary_key │ __sequence │ __op_type │
├──────────┼────────┼────────────┼──────┼────────────┼──────────────┼────────────┼────────────────────┼───────────────┼────────────┼───────────┤
│  host1   │  cn    │    dc1     │  t1  │     10     │      5       │     85     │ 2024-01-01 00:00   │     key0      │     1      │    PUT    │
│  host1   │  cn    │    dc1     │  t1  │     12     │      6       │     82     │ 2024-01-01 00:01   │     key0      │     2      │    PUT    │
│  host1   │  cn    │    dc1     │  t1  │     15     │      8       │     77     │ 2024-01-01 00:02   │     key0      │     3      │    PUT    │
│  host2   │  us    │    dc2     │  t2  │     20     │     10       │     70     │ 2024-01-01 00:00   │     key1      │     1      │    PUT    │
│  host2   │  us    │    dc2     │  t2  │     22     │     11       │     67     │ 2024-01-01 00:01   │     key1      │     2      │    PUT    │
└──────────┴────────┴────────────┴──────┴────────────┴──────────────┴────────────┴────────────────────┴───────────────┴────────────┴───────────┘

Storing tags as individual columns enables:

Per-column statistics for predicate pushdown on any tag, not just the first one
Independent filtering on each tag column
Reading specific tags without decoding the full PK
Easy analysis with standard Parquet tools

The encoded __primary_key column is retained for efficient sorting, merging, and dedup. Although this introduces some redundancy, Parquet's columnar storage and dictionary encoding keep file sizes nearly identical—within 1-2% in TSBS benchmarks.

Since flat format files are essentially the primary-key format plus additional columns, backward compatibility is straightforward.

BulkMemtable

At high series counts, allocating a separate buffer for each series becomes inefficient—each series incurs fixed memory overhead regardless of how much data it holds. BulkMemtable borrows from LSM-Tree designs by treating incoming data as independent parts rather than maintaining per-series structures.

Incoming writes are converted into sorted BulkPart structures and appended to a part list.

Each BulkPart stores data in an Apache Arrow^[5] RecordBatch^[6] using the flat schema. Rows are sorted by (primary_key, timestamp, sequence DESC).

+-----------------------------------------------------------------------------------------------+
|                                          BulkPart                                             |
+-----------------------------------------------------------------------------------------------+
| min_timestamp: 2024-01-01 00:00                                                               |
| max_timestamp: 2024-01-01 00:02                                                               |
| sequence: 3                                                                                   |
+-----------------------------------------------------------------------------------------------+
|                                     RecordBatch (Arrow)                                       |
| ┌──────────┬────────┬────────────┬─────┬───────────────┬────────────┬───────────┐             |
| │ hostname │ region │ usage_user │ ... │ __primary_key │ __sequence │ __op_type │             |
| ├──────────┼────────┼────────────┼─────┼───────────────┼────────────┼───────────┤             |
| │  host1   │  cn    │     10     │ ... │     key0      │     1      │    PUT    │             |
| │  host1   │  cn    │     12     │ ... │     key0      │     2      │    PUT    │             |
| │  host2   │  us    │     20     │ ... │     key1      │     1      │    PUT    │             |
| │  host2   │  us    │     22     │ ... │     key1      │     2      │    PUT    │             |
| └──────────┴────────┴────────────┴─────┴───────────────┴────────────┴───────────┘             |
|     ↑                                                                                         |
|  Sorted by (primary key, timestamp, seq desc)                                                 |
+-----------------------------------------------------------------------------------------------+

Parts are organized by size:

+------------------+
|   BulkMemtable   |
+------------------+
         |
         v
+------------------+
|    BulkParts     |
+------------------+
|                  |
| +-------------+  |
| |UnorderedPart|--|--> small parts
| +-------------+  |
|       |          |
|       v          |
| +-------------+  |
| |    parts    |--|--> large parts
| |  (Vec)      |  |
| +-------------+  |
|       |          |
|       v          |
| +-------------+  |
| |encoded_parts|--|--> parts encoded in Apache Parquet format
| |  (Vec)      |  |
| +-------------+  |
+------------------+

UnorderedPart optimizes small-batch writes. When an incoming batch has fewer than 1,024 rows (the default threshold), the BulkPart is cached in UnorderedPart. Once the cache hits 4,096 rows, it's merge-sorted into a single BulkPart. This avoids the overhead of repeatedly merging many tiny parts. These thresholds can be tuned for specific workloads.

Large BulkParts go directly into the parts list. When the list grows too long, a merge is triggered. The merge sorts parts by row count and combines similarly-sized parts, keeping data sorted. If dedup is enabled (append mode disabled), duplicates are removed during merging.

After merging, BulkParts are encoded into EncodedBulkParts—in-memory Parquet files. Parquet encoding and compression further reduce memory footprint.

Bottom line: larger write batches mean better throughput and less internal merge overhead with BulkMemtable.

Performance

We benchmarked flat format against the primary-key format using TSBS^[7], an industry-standard time-series benchmark suite.

Test environment


CPU	AMD Ryzen 7 7735HS, 16C @ 3.2GHz
Memory	32GB
Disk	SOLIDIGM SSDPFKNU010TZ
OS	Ubuntu 22.04.2 LTS

Write throughput at 4K series

At low cardinality, flat format shows some write performance regression compared to primary-key format, though larger batch sizes mitigate this:

Client Configuration	primary-key (rows/s)	flat (rows/s)
6 workers, batch 3,000	401,416	371,473
6 workers, batch 20,000	402,473	407,408
6 workers, batch 30,000	412,436	405,963
12 workers, batch 20,000	441,445	478,442

Write throughput at 2M series

At high cardinality, flat format pulls far ahead:

Client Configuration	primary-key (rows/s)	flat (rows/s)
6 workers, batch 3,000	87,141	363,741 (4.2×)

Query latency at 2M series

Several queries show dramatic improvements:

Query	primary-key (ms)	flat (ms)	Speedup
cpu-max-all-1	530.62	364.69	1.5×
cpu-max-all-8	17,317.18	3,522.58	4.9×
high-cpu-1	542.29	291.12	1.9×
single-groupby-1-1-1	124.65	60.52	2.1×
single-groupby-1-1-12	408.15	437.35	0.9×
single-groupby-1-8-1	3,554.64	430.95	8.2×
single-groupby-5-1-1	147.15	67.84	2.2×
single-groupby-5-1-12	492.27	476.50	1.0×
single-groupby-5-8-1	4,722.53	467.78	10.1×

Usage and Best Practices

If your workload includes high-cardinality columns in the primary key, flat format is worth trying. In one real case, a user hit a write bottleneck with primary-key format—despite moderate traffic, each flush processed over 14 million series, making writes unusable. Switching to flat format resolved the issue.

Option 1: Create a new table with flat format

Specify sst_format='flat' in the CREATE TABLE statement^[8]:

sql

CREATE TABLE `http_logs` (
  `access_time` TIMESTAMP TIME INDEX,
  `application` STRING,
  `remote_addr` STRING,
  `http_status` STRING,
  `http_method` STRING,
  `http_refer` STRING,
  `user_agent` STRING,
  `request_id` STRING,
  `request` STRING,
  PRIMARY KEY(`application`, `request_id`)
) with ('append_mode'='true', 'sst_format'='flat');

For example, an http_logs table where queries filter by application and request_id can safely include request_id in the PK. With the primary-key format, this might cause OOM; flat format handles it fine.

Option 2: Convert an existing table

Use ALTER TABLE^[9]:

sql

ALTER TABLE cpu SET 'sst_format' = 'flat';

Note: Once converted to flat format, a table cannot be reverted to primary-key format, as the primary-key format cannot read data files generated by flat format.

Best practices

Under a million series? No need to migrate—primary-key format works well
Using flat format? Update to the latest GreptimeDB version
Maximize throughput: Use batch sizes of 10,000+ rows
Dedup overhead is unavoidable: Even with flat format, dedup costs scale with data volume. Append-only tables still deliver the best query performance

Since flat format is a superset of the primary-key format, we plan to make it the default over time. We're continuing to test and optimize the flat format paths—feedback and benchmarks from real workloads are welcome.

Scaling Time Series to Millions of Cardinalities: GreptimeDB's Flat Format

Background

Flat Format

Storage Layout

BulkMemtable

Performance

Usage and Best Practices

References

Join our community

Scaling Time Series to Millions of Cardinalities: GreptimeDB's Flat Format

Background ​

Flat Format ​

Storage Layout ​

BulkMemtable ​

Performance ​

Usage and Best Practices ​

References ​

Join our community

Background

Flat Format

Storage Layout

BulkMemtable

Performance

Usage and Best Practices

References