What is GreptimeDB
GreptimeDB is a powerful time-series database built using the Rust programming language, making it safe, fast, and efficient for real-time writing, reading, and analysis of time series data.
In order to help new users quickly master the usage of GreptimeDB and get value out of real time, we have prepared a series of articles on how to use GreptimeDB. These articles will provide detailed installation and usage steps, categorized according to different use cases, ensuring an effective application of our product in specific application environments.
Together, let's explore the infinite possibilities of GreptimeDB!
How to use GreptimeDB to analyze external data
The true value of data can only be realized when it is being analyzed. GreptimeDB enhances data analysis capabilities by supporting SQL and Python scripts.
Let's say you have a data file in CSV format and you're interested in leveraging the analytical power of GreptimeDB. There are two ways that GreptimeDB provides for rapid querying and analysis of external data. In the subsequent sections, we'll elaborate on each of these methods.
To simplify our explanation, let's imagine a hypothetical scenario. This scenario draws from a fragment of the publicly available Kaggle dataset:
You are collaborating with the government to create a digital and intelligent city that improves the efficiency of public services. One of the challenges is traffic congestion. As a data scientist, your goal is to better manage urban traffic and provide advice on future infrastructure planning.
The government's objective is to establish a robust transportation system that can adeptly handle peak traffic periods. To achieve this, they've provided you with a comprehensive set of historical data outlining traffic patterns at various city intersections. Your assignment is to leverage this data to develop a model that enables effective analysis and prediction of traffic trends.
Prepare
Environment
If you do not have GreptimeDB installed in your environment, the following content will guide you on how to quickly install and start GreptimeDB. If you have already installed and started GreptimeDB locally, you can skip to the data preparation chapter.
- Install GreptimeDB (using Docker as an example)
Open the terminal on your computer and enter the following code. (The same applies below.)
docker pull greptime/greptimedb:latest
For more installation methods, navigate to the download page or Getting Started docs to learn more.
- Start GreptimeDB
docker run -p 4000-4004:4000-4004 \
-p 4242:4242 -v "$(pwd)/greptimedb:/tmp/greptimedb" \
--name greptime --rm greptime/greptimedb standalone start \
--http-addr 0.0.0.0:4000 \
--rpc-addr 0.0.0.0:4001 \
--mysql-addr 0.0.0.0:4002 \
--postgres-addr 0.0.0.0:4003 \
--prom-addr 0.0.0.0:4004 \
--opentsdb-addr 0.0.0.0:4242
- Connect to GreptimeDB
Use the following code to connect to GreptimeDB using MySQL or other compatible clients to ensure the service is working properly.
mysql -h 127.0.0.1 -P 4002
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 5.1.10-alpha-msql-proxy Greptime
Copyright (c) 2000, 2021, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show tables;
+---------+
| Tables |
+---------+
| numbers |
| scripts |
+---------+
2 rows in set (0.01 sec)
Data
GreptimeDB supports importing data in the formats of CSV, JSON, and Parquet. We have downloaded an IoT dataset from the publicly available Kaggle dataset that primarily describes the traffic conditions at four junctions in a city.
The following shows a part of the data.
DateTime,Junction,Vehicles,ID
2015-11-01 00:00:00,1,15,20151101001
2015-11-01 01:00:00,1,13,20151101011
2015-11-01 02:00:00,1,10,20151101021
2015-11-01 03:00:00,1,7,20151101031
2015-11-01 04:00:00,1,9,20151101041
2015-11-01 05:00:00,1,6,20151101051
2015-11-01 06:00:00,1,9,20151101061
2015-11-01 07:00:00,1,8,20151101071
2015-11-01 08:00:00,1,11,20151101081
While preparing your data, it's crucial to ensure that the field names in the data header do not match any of the keywords within the GreptimeDB. For instance, DateTime
in the data table is a keyword reserved by the database. Therefore, it is necessary to modify such names before importing the data. As a suggestion, DateTime
could be renamed to TS
.
In this tutorial, we started GreptimeDB through Docker and mapped files using -v "$(pwd)/greptimedb:/tmp/greptimedb"
. As a result, we need to copy the prepared data file to the corresponding directory.
copy /<path_of_data_folder>/train_ML_IOT.csv $(pwd)/greptimedb/
Query and analyze external data
Query data through external table
GreptimeDB facilitates the process of querying and analyzing external data directly, simplifying the procedure for users.
- Create external table
mysql> CREATE EXTERNAL TABLE train_ML_IOT_sample WITH (LOCATION='/tmp/greptimedb/train_ML_IOT.csv',format='csv');
Query OK, 0 rows affected (0.08 sec)
- Query and analyze the data through SQL/Python
mysql> SELECT * FROM "train_ML_IOT_sample" WHERE "Junction" = 1 LIMIT 10;
+---------------------+----------+----------+-------------+
| TS | Junction | Vehicles | ID |
+---------------------+----------+----------+-------------+
| 2015-11-01 00:00:00 | 1 | 15 | 20151101001 |
| 2015-11-01 01:00:00 | 1 | 13 | 20151101011 |
| 2015-11-01 02:00:00 | 1 | 10 | 20151101021 |
| 2015-11-01 03:00:00 | 1 | 7 | 20151101031 |
| 2015-11-01 04:00:00 | 1 | 9 | 20151101041 |
| 2015-11-01 05:00:00 | 1 | 6 | 20151101051 |
| 2015-11-01 06:00:00 | 1 | 9 | 20151101061 |
| 2015-11-01 07:00:00 | 1 | 8 | 20151101071 |
| 2015-11-01 08:00:00 | 1 | 11 | 20151101081 |
| 2015-11-01 09:00:00 | 1 | 12 | 20151101091 |
+---------------------+----------+----------+-------------+
10 rows in set (0.02 sec)
In this approach, data isn't actually imported into GreptimeDB, yet the analytical capabilities of GreptimeDB can still be leveraged. In the following section, we demonstrate a recommended method to do so.
Query data through importing external data
When dealing with substantial data volumes, particularly exceeding 100MB, direct queries on external tables may result in poor performance. In such instances, it is advisable to import the data into GreptimeDB for analysis.
- Create table
We need to create a table in advance for the data to be imported, including fields, types, and indexes used for queries, etc. Suppose there might be a need to query the traffic situation at different junctions by time in the above data, then we need to create indices for the TS
and Junction
. In a time-series database, the TS
field often serves as the Time index, while Junction
is a well-known Primary Key.
Tip: For the table creation process, you can quickly obtain the schema by creating an external table, following this specific method:
mysql> CREATE EXTERNAL TABLE train_ML_IOT_sample WITH (LOCATION='/tmp/greptimedb/train_ML_IOT.csv',format='csv');
Query OK, 0 rows affected (0.08 sec)
mysql> SHOW CREATE TABLE train_ML_IOT_sample;
+---------------------+------------------------------------------------------------+
| Table | Create Table |
+---------------------+------------------------------------------------------------+
| train_ML_IOT_sample | CREATE EXTERNAL TABLE IF NOT EXISTS train_ML_IOT_sample (
TS TIMESTAMP(0) NULL,
Junction BIGINT NULL,
Vehicles BIGINT NULL,
ID BIGINT NULL,
)
ENGINE=file
WITH(
format = 'csv',
location = '/tmp/greptimedb/train_ML_IOT.csv'
) |
+---------------------+------------------------------------------------------------+
Based on the results above, we can see the field types recognized by GreptimeDB. Then we need to copy/paste the field names and types and execute the following code to create an internal table:
CREATE TABLE IF NOT EXISTS train_ML_IOT (
TS TIMESTAMP(0) NULL,
Junction BIGINT NULL,
Vehicles BIGINT NULL,
ID BIGINT NULL,
TIME INDEX (TS),
PRIMARY KEY (Junction)
);
- Import data
GreptimeDB supports the COPY syntax for importing and exporting data. For more details, please refer to https://docs.greptime.com/reference/sql/copy
mysql> COPY train_ML_IOT FROM '/tmp/greptimedb/train_ML_IOT.csv' WITH (FORMAT='csv');
Query OK, 48120 rows affected (0.11 sec)
- Query and analyze the data through SQL/Python
mysql> SELECT * FROM "train_ML_IOT" WHERE "Junction" = 1 LIMIT 10;
+---------------------+----------+----------+-------------+
| TS | Junction | Vehicles | ID |
+---------------------+----------+----------+-------------+
| 2017-06-20 08:00:00 | 1 | 60 | 20170620081 |
| 2017-06-20 09:00:00 | 1 | 76 | 20170620091 |
| 2017-06-20 10:00:00 | 1 | 92 | 20170620101 |
| 2017-06-20 11:00:00 | 1 | 102 | 20170620111 |
| 2017-06-20 12:00:00 | 1 | 113 | 20170620121 |
| 2017-06-20 13:00:00 | 1 | 100 | 20170620131 |
| 2017-06-20 14:00:00 | 1 | 107 | 20170620141 |
| 2017-06-20 15:00:00 | 1 | 110 | 20170620151 |
| 2017-06-20 16:00:00 | 1 | 101 | 20170620161 |
| 2017-06-20 17:00:00 | 1 | 108 | 20170620171 |
+---------------------+----------+----------+-------------+
10 rows in set (0.03 sec)
Conclusion
The two methods mentioned above allow for swift analysis and querying of external data via GreptimeDB. By harnessing the power of machine learning libraries, we can effectively carry out predictive analysis. Given that GreptimeDB supports Python, users can also employ machine learning libraries within the Python ecosystem for data modeling and analysis. For Python users, navigate to this document to learn more about data analysis using python.
GreptimeDB is currently undergoing swift iterations, and we are committed to providing more informative tutorials and best practice guidelines. Should you have any questions or encounter any issues, join our Slack community and reach out to us for assistance.