Join us for a virtual meetup on Zoom at 8 PM, July 31 (PDT) about using One Time Series Database for Both Metrics and Logs 👉🏻 Register Now

Skip to content
On this page
Engineering
January 22, 2025

Vector Search for Intelligent Similarity Data Querying in GreptimeDB

GreptimeDB v0.10 introduces a vector retrieval feature, this article demonstrates how to efficiently and accurately extract semantically similar information from vast amounts of data. It explains how to embed text data, insert vector data into GreptimeDB, and perform vector similarity searches.

In the wave of digital transformation, data flows in like a tide, and we are constantly surrounded by vast amounts of information. Imagine being in a massive library containing billions of books, documents, images, and videos. If you wanted to quickly find materials related to a specific image or locate documents that are similar to a particular text description, traditional search methods would often fall short.

Traditional search is mostly based on keyword matching, which may be effective for simple, clear queries but struggles with more complex semantic understanding and similarity judgment. For example, when searching for texts or logs with similar meanings but different phrasing, or images with similar styles, traditional search methods often fail to deliver precise results. This is where vector search provides a fundamental solution.

What is Vector Search: A Smarter Way to Retrieve Information

Unlike traditional keyword-based matching, vector search transforms text, images, audio, and other types of data into numerical vectors. These vectors act as "digital fingerprints" containing key features of the data. By calculating the similarity between these vectors, it can quickly identify other data that is similar to the target data. For instance, when searching for an image, vector search first converts the image into a vector and then compares it against a vast database of image vectors, identifying the images most similar to the original.

Vector search opens up new possibilities for efficient and accurate information retrieval. GreptimeDB v0.10 integrates the VSAG vector search library, an open-source project from Ant Group, bringing powerful vector search capabilities to the table.

Now let’s dive deeper into how it stands out in the era of data explosion.

GreptimeDB has also open-sourced the Rust binding library for VSAG. Those interested can explore the code on GitHub: https://github.com/GreptimeTeam/VSAG-sys.

The following illustration uses Python 3.12, inspired by the article Vector Databases: A Beginner’s Guide!

Download Python Dependencies

First, you need to install some necessary dependency libraries for Python. You are recommended to use venv to configure the virtual environment:

sql
pip3 install wget --quiet
pip3 install openai==1.3.3 --quiet
pip3 install sentence-transformers --quiet
pip3 install pandas --quiet
pip3 install sqlalchemy --quiet
pip3 install mysql-connector-python --quiet

The sentence-transformers library is used for vectorizing text, while sqlalchemy and mysql-connector-python are used to connect to GreptimeDB and perform read and write operations. All the rest are utility libraries.

Download Model and Datasets

We use the AG News dataset, which contains a total of 2,000 records.

First, import the necessary dependencies:

python
import json
import os
import pandas as pd
import wget
from sentence_transformers import SentenceTransformer
import sqlalchemy as sa
from sqlalchemy import create_engine

Download model:

python
model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

Download and parse datasets:

python
cvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'
file_path = 'AG_news_samples.csv'

if not os.path.exists(file_path):
    wget.download(cvs_file_path, file_path)
    print('File downloaded successfully.')
else:
    print('File already exists in the local file system.')

df = pd.read_csv('AG_news_samples.csv')
data = df.to_dict(orient='records')

Create Tables

Assuming you have correctly installed the standalone version of GreptimeDB according to the installation guide, connect to the database using the MySQL client or open the Dashboard to create a table:

You can also use the GreptimeCloud managed service to test this example, register here for free.

sql
CREATE TABLE IF NOT EXISTS news_articles (
    title STRING FULLTEXT,
    description STRING FULLTEXT,
    genre STRING,
    embedding VECTOR(768),
    ts timestamp default current_timestamp(),
    PRIMARY KEY(title),
    TIME INDEX(ts)
);
  • title, description, and genre correspond to the article’s title, description, and type information, all of which are of the STRING type, with full-text indexing applied to both the title and description.
  • embedding is set to a 768-dimensional VECTOR type.
  • GreptimeDB's table model enforces a timestamp column. Since our test dataset does not include the article creation time, we set the ts column to default as current_timestamp.

Data Write-in

Now let's embed the description of the datasets:

python
descriptions = [row['description'] for row in data]
all_embeddings = model.encode(descriptions)

When using SQL to insert vector data types into GreptimeDB, the vectors need to be converted into strings for processing. Therefore, we write a function to serialize the vector array into a string and process the dataset:

python
def embedding_s(embedding):
    return f"[{','.join(map(str, embedding))}]"
    
for row, embedding in zip(data, all_embeddings):
    row['embedding'] = embedding_s(embedding)

Note: if you choose to write data with a gRPC client, then you don't have to do the conversion, it can write binary directly.

Connect to the database:

python
connection_string = "mysql+mysqlconnector://root:@0.0.0.0:4002/public"
conn = create_engine(connection_string, echo=True).connect()

Here, we are connecting to GreptimeDB running locally on port 0.0.0.0:4002, with the database set to public, the username as root, and no password. If you are using GreptimeCloud for testing, please replace them with the correct parameters.

Data ingestion:

python
statement = sa.text('''
    INSERT INTO news_articles (
        title,
        description,
        genre,
        embedding
    )
    VALUES (
        :title,
        :description,
        :label,
        :embedding
    )
''')

for i in range(0, len(data), 100):
    conn.execute(statement, data[i:i + 100])

We write the data to GreptimeDB in batches of 100. If everything goes smoothly, we can try executing the following query in the MySQL client or Dashboard:

sql
SELECT title, description, genre, vec_to_string(embedding) 
   FROM news_articles LIMIT 1\G;

The result should be as below:

GreptimeDB Data Ingestion Result
Data Ingestion Result

We use vec_to_string to serialize the vector type into a string and print it.

Next, we try performing a vector search to find articles with similar semantics:

python
search_query = 'China Sports'
search_embedding = embedding_s(model.encode(search_query))

query_statement = sa.text('''
    SELECT
        title,
        description,
        genre,
        vec_dot_product(embedding, :embedding) AS score
    FROM news_articles
    ORDER BY score DESC
    LIMIT 10
''')

results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))
print(results)

The results should be as follow:

bash
                                               title                                        description     genre     score
0  Yao Ming, Rockets in Shanghai for NBA #39;s fi...  SHANGHAI, China The Houston Rockets have arriv...    Sports  0.487205
1         Day 6 Roundup: China back on winning track  After two days of gloom, China was back on the...    Sports  0.438824
2                     NBA brings its game to Beijing  BEIJING The NBA has reached booming, basketbal...    Sports  0.434785
3                  China supreme heading for Beijing  ATHENS: China, the dominant force in world div...    Sports  0.414838
4          IBM, Unisys work to rejuvenate mainframes  Big Blue adds features, beefs up training effo...  Sci/Tech  0.403031
5                        China set for F1 Grand Prix  A bird #39;s eye view of the circuit at Shangh...    Sports  0.401599
6      China Computer Maker Acquires IBM PC Biz (AP)  AP - China's biggest computer maker, Lenovo Gr...  Sci/Tech  0.382631
7  Wagers on oil price prove a slippery slope for...  State-owned, running a monopoly on imports of ...  Business  0.374331
8               Microsoft Order Cancelled by Beijing  The Chinese city of Beijing has cancelled an o...  Sci/Tech  0.359765
9     Trading Losses at Chinese Firm Coming to Light  The disclosure this week that a Singapore-list...  Business  0.352431

Yao Ming is the headline!

Here, we first perform embedding on the search keyword, then use the vec_dot_product function to calculate the dot product of two vectors as the similarity score. Sort the results and limit the output to 10 entries. For more vector functions, please refer to the documentation: https://docs.greptime.com/nightly/reference/sql/functions/vector/

Now let's compare the result with the one based on Keyword-match search:

python
search_query = 'China Sports'
query_statement = sa.text('''
    SELECT
        title,
        description,
        genre
    FROM news_articles
    WHERE matches(description, :search_query)
    LIMIT 10
''')

results = pd.DataFrame(conn.execute(query_statement, dict(search_query=search_query)))
print(results)

We can see the results have changed:

java
                                               title                                        description     genre
0                 Beijing signs pact for Asean trade  VIENTIANE, Laos China moved yet another step c...     World
1               China to strengthen coal mine safety  China will take tough measures this winter to ...     World
2               Grace Park, Koch share lead in Korea  Jeju Island, South Korea (Sports Network) - Gr...    Sports
3  T. rex #39;s smaller ancestor was covered with...  Fossil remains of the oldest and smallest know...  Sci/Tech
4           Boston Red Sox Team Report - September 6  (Sports Network) - Two of the top teams in the...    Sports
5           China's Economic Boom Still Roaring (AP)  AP - China's economic boom is still roaring de...  Business
6                      Guo tucks away gold for China  China's Guo Jingjing easily won the women's 3-...    Sports
7               Microsoft Order Cancelled by Beijing  The Chinese city of Beijing has cancelled an o...  Sci/Tech
8  Powell wins backing for new NKorea pressure, b...  AFP - US Secretary of State Colin Powell wrapp...     World
9  U.S. to Urge China to Push for More N.Korea Talks   BEIJING (Reuters) - Secretary of State Colin ...     World

This result is based on text matching, with China or Sports appearing in the description.

Harnessing the Power of Vector Search for Data Intelligence

All the source code for this article can be found in this repository.

GreptimeDB’s vector search functionality helps users find accurate answers in vast datasets, and this is just the beginning. In the process of digital transformation, vector databases will become an essential tool for driving data intelligence. Whether it’s AIOps, observability, or AI applications, vector search-based technologies will bring unprecedented efficiency and accuracy to the industry.

If you’re interested in vector search and GreptimeDB, we encourage you to dive deeper and join our community to help create a smarter future.


About Greptime

Greptime offers industry-leading time series database products and solutions to empower IoT and Observability scenarios, enabling enterprises to uncover valuable insights from their data with less time, complexity, and cost.

GreptimeDB is an open-source, high-performance time-series database offering unified storage and analysis for metrics, logs, and events. Try it out instantly with GreptimeCloud, a fully-managed DBaaS solution—no deployment needed!

The Edge-Cloud Integrated Solution combines multimodal edge databases with cloud-based GreptimeDB to optimize IoT edge scenarios, cutting costs while boosting data performance.

Star us on GitHub or join GreptimeDB Community on Slack to get connected.

Join our community

Get the latest updates and discuss with other users.