Join us for a virtual meetup on Zoom at 8 PM, July 31 (PDT) about using One Time Series Database for Both Metrics and Logs 👉🏻 Register Now

Skip to content
On this page
Events
December 19, 2024

Greptime Engineer Ruihang Xia Presented at CMUDB Seminar — Apache DataFusion Database Practices

GreptimeDB Engineer, Ruihang Xia, was invited to participate in the CMUDB Fall 2024 Seminar series, Database Building Blocks. He delivered a talk on the topic Implement, Integrate, and Extend a Query Engine, to demonstrate how GreptimeDB leverages DataFusion as a framework to implement custom functionalities and optimizations, boosting the database’s query performance and flexibility.

In the digital age, database technology is not just the backbone of information management but also a critical factor in shaping enterprise data architectures and optimizing performance. To explore the latest trends and challenges in the field of databases, the Database Research Group at Carnegie Mellon University (CMU) hosted the Fall 2024 Seminar series, Database Building Blocks: Foundations of MoCdern Database Systems.

The event is organized by Andy Pavlo, Jignesh Patel, and Sam Arch, began on September 23 and covered a variety of technical topics including Apache Arrow DataFusion, Apache Spark, Postgres, bioinformatics databases, and OpenDAL, among others. The seminar series invited experts in the database field to share their research findings and technical challenges.

Greptime Engineer was invited to share database insights

GreptimeDB’s Senior R&D Engineer, Ruihang Xia, was invited to participate in the series. He delivered a talk on the topic Implement, Integrate, and Extend a Query Engine, for an in-depth analysis of the Apache DataFusion project and demonstrated how GreptimeDB leverages DataFusion as a framework to implement custom functionalities and optimizations, boosting the database’s query performance and flexibility.

GreptimeDB is built on Apache DataFusion along with several other widely used modules. The presentation focused on how to manage queries across different components of a time-series database. Ruihang shared the process of extending DataFusion to support PromQL, enhancing SQL syntax, integrating with external secondary indexes, and implementing domain-specific optimization rules. Each of these features contributes to optimizing query execution at various stages.

Ruihang also explored how to use DataFusion and Apache Arrow to build and optimize custom query plans, highlighting how these frameworks can improve database flexibility and performance to meet different application scenarios.

Event Information

🎤 Talk Topic: Implement, Integrate, and Extend a Query Engine

🙋 Speaker: Ruihang Xia, GreptimeDB Maintainer, Apache DataFusion PMC Member, Arrow Committer, HoraeDB PPMC Member

🔍 Key Highlights:

  • Recalling the old days of building a query engine by hand
  • Extending Apache DataFusion: PromQL & Elasticsearch query string support
  • Performance extensions: Windowed Sort
  • Architectural extensions: Distributed queries
  • Query type extensions: Geo / Vector / JSON / and more

For those interested in a deeper dive into the topic, you can watch the full presentation on YouTube. The presentation material can be found here.

Open to All: This session is open to anyone interested in open-source technology, database systems, and related topics. We welcome database enthusiasts from around the world to join.

About Apache Datafusion

DataFusion is an high performance query planning, optimization, and execution framework. DataFusion was created in 2017 and donated to the Apache Arrow project in 2019. DataFusion is written in Rust and takes advantage of Arrow’s in-memory data model for performance and compatibility with other projects.

The long term goal of DataFusion is to become an embedded query engine that can be used with any analytics application while providing SQL compatibility, Pandas type dataframe API, the ability to create execution plans via API, and provide best in class query performance across all of these different APIs.

About Greptime

Greptime offers industry-leading time series database products and solutions to empower IoT and Observability scenarios, enabling enterprises to uncover valuable insights from their data with less time, complexity, and cost.

GreptimeDB is an open-source, high-performance time-series database offering unified storage and analysis for metrics, logs, and events. Try it out instantly with GreptimeCloud, a fully-managed DBaaS solution—no deployment needed!

The Edge-Cloud Integrated Solution combines multimodal edge databases with cloud-based GreptimeDB to optimize IoT edge scenarios, cutting costs while boosting data performance.

Star us on GitHub or join GreptimeDB Community on Slack to get connected.

Join our community

Get the latest updates and discuss with other users.