Join us for a virtual meetup on Zoom at 8 PM, July 31 (PDT) about using One Time Series Database for Both Metrics and Logs 👉🏻 Register Now

Skip to content
On this page
Engineering
October 9, 2024

Empowering Developers through Open Source - Building a Cloud-Native Time-Series Database on the ASF Ecosystem

GreptimeDB is a cloud-native, high-performance time-series database built on the ASF open-source ecosystem, aimed at providing users with a flexible solution for managing time-series data. This article will explore its deep integration with open-source projects and its core features.

GreptimeDB is a cloud-native, high-performance time-series database designed to provide users with flexible time-series data management solutions.

Unified Time-Series Data Handling

When managing data, we prioritize not just traditional Metrics but also the integration of timestamps and contextual information. GreptimeDB can manage and process various time-series data, including Events, Logs, and Traces. This unified approach enables seamless integration and analysis of diverse data, allowing comprehensive system monitoring and maximizing data value.

sql
SELECT
    time,
    host,
    approx_percentile_cont(latency, 0.95) RANGE '15s' as p95_latency,
    count(error) RANGE '15s' as num_errors,
FROM
    metrics INNER JOIN logs on metrics.host = logs.host
WHERE
    time > now() - INTERVAL '1 hour' AND
    matches(path, '/api/v1/avatar')
ALIGN '5s' BY (host) FILL PREV

Start with Familiar Protocols

GreptimeDB supports multiple protocols for writing and querying time-series data. Whether you prefer SQL, Prometheus, InfluxDB, or OpenTSDB, GreptimeDB ensures a smooth transition in data writing processes. It offers both SQL and PromQL (Prometheus Query Language) for querying data, providing powerful and flexible data retrieval capabilities that help you quickly manage time-series data without extra learning costs.

Ubiquitous Time-Series Data

One of GreptimeDB's strengths is its flexible deployment capabilities. It can operate in the cloud or at the edge, leveraging edge computing advantages. By processing data at the source, GreptimeDB can significantly reduce data transmission, saving up to 97% in bandwidth. This design eliminates lengthy data pipelines, directly enhancing system throughput.

Cloud-Native & Scalable

Figure 1: Share Storage Architecture vs Share Nothing Architecture
Figure 1: Share Storage Architecture vs Share Nothing Architecture

GreptimeDB's architecture fully utilizes cloud-native advantages, employing a shared storage model with S3 as a storage medium. Compared to EBS (gp3), S3 offers a cost reduction of 75% while providing up to 200 Gbps throughput on a single compute resource. This architecture lowers storage costs and performs exceptionally well in large-scale data processing. Additionally, it dramatically improves the efficiency of partition migrations, reducing migration time from hours to seconds or even milliseconds, ensuring zero downtime and business continuity.

Modular Design

Building a database system from scratch is complex, involving numerous critical components such as Catalogs, SQL parsers, data type systems, WAL, storage, optimizers, and query engines. Each component requires substantial resources and time to develop.

Figure 2: Evolution of the compiler
Figure 2: Evolution of the compiler

However, the trend in software engineering is moving towards modular design. The evolution of compilers, exemplified by LLVM, has enabled languages like Swift and Rust to share backends, accelerating technology iteration and sharing. Similarly, a modular component design in database systems can significantly speed up development and enhance flexibility.

Figure 3: Evolution of the database
Figure 3: Evolution of the database

Building Time-Series Databases on Apache Open Source Projects

In constructing our time-series database system, we leverage several key Apache open-source projects:

  • Apache Arrow: Provides an efficient in-memory columnar format for fast random access and in-memory data processing.
  • Apache DataFusion: A fast, embeddable, and scalable query engine offering SQL and DataFrame APIs, utilizing Apache Arrow as its memory model for efficient data processing.
  • Apache OpenDAL: Offers a unified data access layer that simplifies the integration and management of various data sources.
  • Apache Parquet: A columnar storage format optimized for storing and reading large-scale datasets.
  • Apache Kafka (optional): Serves as a Remote WAL, supporting region migration.

Extending SQL

In time-series data processing, querying and aggregating data over specific time ranges is often necessary. However, traditional SQL has limitations in natively supporting time-series queries. To address this, GreptimeDB introduces extended SQL syntax that combines the flexibility of SQL with enhanced native time-series support.

sql
SELECT
    ts,
    avg(temp) RANGE '1d' FILL LINEAR.
FROM
    temperature
WHERE
    city="beijing" and ts < 1682985600000 
ALIGN '1d';

In GreptimeDB, we introduce the ALIGN keyword in SELECT statements to set the step size for time-series queries, aligning time with the calendar. The RANGE keyword specifies the aggregation period, while FILL LINEAR fills in missing data points with average values. These extensions make time-series queries more flexible and efficient.

Supporting PromQL

Figure 4: Rust promql-parser
Figure 4: Rust promql-parser

To be fully compatible with the Prometheus ecosystem, GreptimeDB has implemented comprehensive support for PromQL, currently one of the most compatible third-party independent implementations, reaching up to 82%. We integrated PromQL as a new dialect into the DataFusion execution engine, enabling GreptimeDB to better meet the demands of users within the Prometheus ecosystem.

Easy Multi-Cloud Support

GreptimeDB utilizes Apache OpenDAL as its data access layer, providing a unified API to connect to most existing storage services (supporting a wide range of object storage services such as AWS S3, Google Cloud Storage, Alibaba Cloud OSS). Through open-source collaboration, we have worked with the community to optimize OpenDAL’s read-write performance, ensuring that it can fully utilize the bandwidth of object storage services and enhance data processing efficiency.

Figure 5: Storage services supported by OpenDAL
Figure 5: Storage services supported by OpenDAL

Thriving with Open Source

Figure 6: orc-rust crate
Figure 6: orc-rust crate

We are committed to growing with the open-source community. GreptimeDB’s core developer Ruihang Xia has become a PMC member of the Apache DataFusion project, actively participating in and promoting community development. Additionally, we have contributed the datafusion-orc library to the datafusion-contrib organization and plan to donate the project to Apache soon, as the Rust implementation repository of ORC (datafusion-orc#120). These efforts contribute not only to the advancement of GreptimeDB but also to the flourishing of the open-source ecosystem.

Implement Once, Benefit All

Figure 7: Concurrent Write Benchmark results, opendal#3942
Figure 7: Concurrent Write Benchmark results, opendal#3942

One of the great things about the open-source community is that improvements made upstream benefit all downstream projects. Inspired by academic research, we contributed a concurrent upload feature to OpenDAL, achieving linear growth in write performance. This not only boosts GreptimeDB’s performance but also benefits other projects relying on OpenDAL, truly reflecting the “Implement once, benefit all” philosophy.

May Open Source Be With You

In today’s open-source community, projects are interconnected and contribute to each other, forming a symbiotic relationship where one enhances the other. Several open-source projects contributed by different communities, which have been widely adopted across various ecosystems, embodying the collaborative spirit and shared values of the open-source world. e.g.,:

  • GreptimeDB contributed the datafusion-orc and pgwire projects. datafusion-orc has been adopted by both GreptimeDB and Databend, while pgwire is used by GreptimeDB and CeresDB/HoraeDB.
  • Databend spearheaded the OpenDAL and opensrv-mysql projects. OpenDAL has gained wide adoption across several communities, including Vector, ParadeDB, QuestDB, RisingWave, and GreptimeDB, while opensrv-mysql is also used by GreptimeDB.
  • RisingWave contributed the arrow-udf project, which is utilized by RisingWave and has also been adopted by Databend.
  • CeresDB/HoraeDB led the sqlness project, which is also used by GreptimeDB.
  • InfluxDB open-sourced the DataFusion project, which has been widely adopted by multiple communities, including GreptimeDB and CeresDB/HoraeDB.

Conclusion

In this fast-evolving technological era, building and managing a time-series database is not just a technical challenge but a collaborative journey with global developers. Through its flexible architecture, strong protocol support, and deep engagement with the open-source community, GreptimeDB showcases the endless possibilities of modern database systems. Whether in the cloud or at the edge, for data storage or query optimization, we remain committed to modularity and open source, pushing technology forward alongside developers worldwide.

Reference

[1] S3 throughput : Dominik Durner, Viktor Leis, and Thomas Neumann. 2023. Exploiting Cloud Object Storage for High-Performance Analytics. Proc. VLDB Endow. 16, 11 (July 2023), 2769–2782. https://doi.org/10.14778/3611479.3611486

[2] EBS throughput: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volume-types.html

[3] IBM System / 390: https://en.wikipedia.org/wiki/IBM_System/390

[4] Solaris: https://en.wikipedia.org/wiki/Oracle_Solaris

[5] LLVM: https://llvm.org

[6] Swift: https://www.swift.org

[7] Rust: https://www.rust-lang.org

[8] Oracle: https://www.oracle.com/database/

[9] DB2: https://www.ibm.com/db2

[10] InfluxDB: https://www.influxdata.com

[11] GreptimeDB: https://greptime.com

[12] sunng87/pwrire: https://github.com/sunng87/pgwire

[13] datafuselabs/opensrv: https://github.com/datafuselabs/opensrv

[14] risingwavelabs/arrow-udf: https://github.com/risingwavelabs/arrow-udf

[15] ceresdb/sqlness: https://github.com/CeresDB/sqlness

[16] apache/arrow: https://arrow.apache.org/

[17] apache/datafusion: https://datafusion.apache.org/

[18] apache/opendal: https://opendal.apache.org/

[19] apache/parquet: https://parquet.apache.org/

[20] apache/kafka: https://kafka.apache.org/

[21] datafusion-contrib/datafusion-orc: https://github.com/datafusion-contrib/datafusion-orc

[22] GreptimeDB/promql-parser: https://github.com/GreptimeTeam/promql-parser

[23] datafusion-contrib/datafusion-orc#120: https://github.com/datafusion-contrib/datafusion-orc/issues/120

[24] apache/opendal#3942: https://github.com/apache/opendal/pull/3942


About Greptime

We help industries that generate large amounts of time-series data, such as Connected Vehicles (CV), IoT, and Observability, to efficiently uncover the hidden value of data in real-time.

Visit the latest version from any device to get started and get the most out of your data.

  • GreptimeDB, written in Rust, is a distributed, open-source, time-series database designed for scalability, efficiency, and powerful analytics.
  • Edge-Cloud Integrated TSDB is designed for the unique demands of edge storage and compute in IoT. It tackles the exponential growth of edge data by integrating a multimodal edge-side database with cloud-based GreptimeDB Enterprise. This combination reduces traffic, computing, and storage costs while enhancing data timeliness and business insights.
  • GreptimeCloud is a fully-managed cloud database-as-a-service (DBaaS) solution built on GreptimeDB. It efficiently supports applications in fields such as observability, IoT, and finance.

Star us on GitHub or join GreptimeDB Community on Slack to get connected. Also, you can go to our contribution page to find some interesting issues to start with.

database
cloud-native
opensource

Join our community

Get the latest updates and discuss with other users.