Join us for a virtual meetup on Zoom at 8 PM, July 31 (PDT) about using One Time Series Database for Both Metrics and Logs 👉🏻 Register Now

Skip to content
On this page
Biweekly
May 10, 2023

Biweekly Report (Apr.24 - May.7) – Support query external data

A recap of the past two-weeks progress and changes happened on GreptimeDB.

Summary

Together with all our contributors worldwide, we are glad to see GreptimeDB making remarkable progress for the better. Below are some highlights:

  • Support DELETE in distributed mode
  • Support query external data
  • Refactor remote catalog manager
  • Support import/export of datasets in the format of CSV and JSON

Contributor list: (in alphabetical order)

For the past two weeks, our community has been super active with a total of 7 PRs from 3 contributors merged successfully and lots pending to be merged. Congrats on becoming our most active contributors in the past 2 weeks:

👏 Let's welcome @DevilExileSu and @NiwakaDev as the new contributors to join our community with their 3 PRs merged respectively.

A big THANK YOU for the generous and brilliant contributions! It is people like you who are making GreptimeDB a great product. Let's build an even greater community together.

Highlights of Recent PR

Support DELETE in distributed mode

We have been committed to supporting the DELETE SQL statement, and initially provided a minimal functional version. Currently, we not only support DELETE in standalone mode, but also in distributed mode, both in gRPC and SQL.

Support query external data

When processing time series data, it is common to combine it with additional information. We now support seamlessly creation of external tables in various formats, including JSON, CSV, and Parquet. The ability to generate external tables using external data sources would be very beneficial. For example, if we have a CSV file on the local file system /var/data/city.csv:

sql
Rank , Name , State , 2023 Population , 2020 Census , Annual Change , Density (mi²)
1 , New York City , New York , 8,992,908 , 8,804,190 , 0.7% , 29,938
2 , Los Angeles , California , 3,930,586 , 3,898,747 , 0.27% , 8,382
3 , Chicago , Illinois , 2,761,625 , 2,746,388 , 0.18% , 12,146
......

Then we can create a table city with it:

sql
MySQL> CREATE EXTERNAL TABLE city with(location="/var/data/city.csv",format="csv");

And query it by SQL:

sql
MySQL> select * from city;

query result

Even join it with time series data:

sql
select temperatures. value, city.population from temperatures  
     left join city on city.name=temperatures.city

Refactor remote catalog manager

Arrow-datafusion provides a set of traits(CatalogList / CatalogProvider / SchemaProvider) for query engines to retrieve table entities by the triplet: catalog name/schema name/table name. But all these traits are synchronous, which means we can not rely on any async operation to retrieve the table entities from remote catalog implementation, for example, metasrv in distributed mode.

In this PR, instead of fetching table entities from underlying storage during planning, we resolve tables from statements in advance and put these tables in a memory based catalog list. As a result, we're able to avoid overhead brought by bridging sync traits with async implementation.

Support import/export of datasets from CSV and JSON format file

We now support the import and export of datasets from CSV and JSON format files. Here are some main changes:

  • Add ParquetRecordBatchStreamAdapter (ParquetRecordBatchStream -> DataFusion RecordBatchStream)
  • Refactor the copy from executor
  • Support copy from CSV and JSON format files
  • Support copy the table to the CSV and JSON format file
  • Add Tests

These are the updates of GreptimeDB and we are constantly making progress. We believe that the strength of our software shines in the strengths of each individual community member. Thanks for all your contributions.

Join our community

Get the latest updates and discuss with other users.