In the wave of digital transformation, data flows in like a tide, and we are constantly surrounded by vast amounts of information. Imagine being in a massive library containing billions of books, documents, images, and videos. If you wanted to quickly find materials related to a specific image or locate documents that are similar to a particular text description, traditional search methods would often fall short.
Traditional search is mostly based on keyword matching, which may be effective for simple, clear queries but struggles with more complex semantic understanding and similarity judgment. For example, when searching for texts or logs with similar meanings but different phrasing, or images with similar styles, traditional search methods often fail to deliver precise results. This is where vector search provides a fundamental solution.
What is Vector Search: A Smarter Way to Retrieve Information
Unlike traditional keyword-based matching, vector search transforms text, images, audio, and other types of data into numerical vectors. These vectors act as "digital fingerprints" containing key features of the data. By calculating the similarity between these vectors, it can quickly identify other data that is similar to the target data. For instance, when searching for an image, vector search first converts the image into a vector and then compares it against a vast database of image vectors, identifying the images most similar to the original.
Vector search opens up new possibilities for efficient and accurate information retrieval. GreptimeDB v0.10 integrates the VSAG vector search library, an open-source project from Ant Group, bringing powerful vector search capabilities to the table.
Now let’s dive deeper into how it stands out in the era of data explosion.
GreptimeDB has also open-sourced the Rust binding library for VSAG. Those interested can explore the code on GitHub: https://github.com/GreptimeTeam/VSAG-sys.
The following illustration uses Python 3.12, inspired by the article Vector Databases: A Beginner’s Guide!
Download Python Dependencies
First, you need to install some necessary dependency libraries for Python. You are recommended to use venv
to configure the virtual environment:
pip3 install wget --quiet
pip3 install openai==1.3.3 --quiet
pip3 install sentence-transformers --quiet
pip3 install pandas --quiet
pip3 install sqlalchemy --quiet
pip3 install mysql-connector-python --quiet
The sentence-transformers
library is used for vectorizing text, while sqlalchemy
and mysql-connector-python
are used to connect to GreptimeDB and perform read and write operations. All the rest are utility libraries.
Download Model and Datasets
We use the AG News dataset, which contains a total of 2,000 records.
First, import the necessary dependencies:
import json
import os
import pandas as pd
import wget
from sentence_transformers import SentenceTransformer
import sqlalchemy as sa
from sqlalchemy import create_engine
Download model:
model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')
Download and parse datasets:
cvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'
file_path = 'AG_news_samples.csv'
if not os.path.exists(file_path):
wget.download(cvs_file_path, file_path)
print('File downloaded successfully.')
else:
print('File already exists in the local file system.')
df = pd.read_csv('AG_news_samples.csv')
data = df.to_dict(orient='records')
Create Tables
Assuming you have correctly installed the standalone version of GreptimeDB according to the installation guide, connect to the database using the MySQL client or open the Dashboard to create a table:
You can also use the GreptimeCloud managed service to test this example, register here for free.
CREATE TABLE IF NOT EXISTS news_articles (
title STRING FULLTEXT,
description STRING FULLTEXT,
genre STRING,
embedding VECTOR(768),
ts timestamp default current_timestamp(),
PRIMARY KEY(title),
TIME INDEX(ts)
);
title
,description
, andgenre
correspond to the article’s title, description, and type information, all of which are of theSTRING
type, with full-text indexing applied to both the title and description.embedding
is set to a 768-dimensionalVECTOR
type.- GreptimeDB's table model enforces a timestamp column. Since our test dataset does not include the article creation time, we set the
ts
column to default ascurrent_timestamp
.
Data Write-in
Now let's embed the description
of the datasets:
descriptions = [row['description'] for row in data]
all_embeddings = model.encode(descriptions)
When using SQL to insert vector data types into GreptimeDB, the vectors need to be converted into strings for processing. Therefore, we write a function to serialize the vector array into a string and process the dataset:
def embedding_s(embedding):
return f"[{','.join(map(str, embedding))}]"
for row, embedding in zip(data, all_embeddings):
row['embedding'] = embedding_s(embedding)
Note: if you choose to write data with a gRPC client, then you don't have to do the conversion, it can write binary directly.
Connect to the database:
connection_string = "mysql+mysqlconnector://root:@0.0.0.0:4002/public"
conn = create_engine(connection_string, echo=True).connect()
Here, we are connecting to GreptimeDB running locally on port 0.0.0.0:4002
, with the database set to public
, the username as root
, and no password. If you are using GreptimeCloud for testing, please replace them with the correct parameters.
Data ingestion:
statement = sa.text('''
INSERT INTO news_articles (
title,
description,
genre,
embedding
)
VALUES (
:title,
:description,
:label,
:embedding
)
''')
for i in range(0, len(data), 100):
conn.execute(statement, data[i:i + 100])
We write the data to GreptimeDB in batches of 100. If everything goes smoothly, we can try executing the following query in the MySQL client or Dashboard:
SELECT title, description, genre, vec_to_string(embedding)
FROM news_articles LIMIT 1\G;
The result should be as below:
We use vec_to_string
to serialize the vector type into a string and print it.
Vector Search
Next, we try performing a vector search to find articles with similar semantics:
search_query = 'China Sports'
search_embedding = embedding_s(model.encode(search_query))
query_statement = sa.text('''
SELECT
title,
description,
genre,
vec_dot_product(embedding, :embedding) AS score
FROM news_articles
ORDER BY score DESC
LIMIT 10
''')
results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))
print(results)
The results should be as follow:
title description genre score
0 Yao Ming, Rockets in Shanghai for NBA #39;s fi... SHANGHAI, China The Houston Rockets have arriv... Sports 0.487205
1 Day 6 Roundup: China back on winning track After two days of gloom, China was back on the... Sports 0.438824
2 NBA brings its game to Beijing BEIJING The NBA has reached booming, basketbal... Sports 0.434785
3 China supreme heading for Beijing ATHENS: China, the dominant force in world div... Sports 0.414838
4 IBM, Unisys work to rejuvenate mainframes Big Blue adds features, beefs up training effo... Sci/Tech 0.403031
5 China set for F1 Grand Prix A bird #39;s eye view of the circuit at Shangh... Sports 0.401599
6 China Computer Maker Acquires IBM PC Biz (AP) AP - China's biggest computer maker, Lenovo Gr... Sci/Tech 0.382631
7 Wagers on oil price prove a slippery slope for... State-owned, running a monopoly on imports of ... Business 0.374331
8 Microsoft Order Cancelled by Beijing The Chinese city of Beijing has cancelled an o... Sci/Tech 0.359765
9 Trading Losses at Chinese Firm Coming to Light The disclosure this week that a Singapore-list... Business 0.352431
Yao Ming is the headline!
Here, we first perform embedding on the search keyword, then use the vec_dot_product
function to calculate the dot product of two vectors as the similarity score. Sort the results and limit the output to 10 entries. For more vector functions, please refer to the documentation: https://docs.greptime.com/nightly/reference/sql/functions/vector/
Now let's compare the result with the one based on Keyword-match search:
search_query = 'China Sports'
query_statement = sa.text('''
SELECT
title,
description,
genre
FROM news_articles
WHERE matches(description, :search_query)
LIMIT 10
''')
results = pd.DataFrame(conn.execute(query_statement, dict(search_query=search_query)))
print(results)
We can see the results have changed:
title description genre
0 Beijing signs pact for Asean trade VIENTIANE, Laos China moved yet another step c... World
1 China to strengthen coal mine safety China will take tough measures this winter to ... World
2 Grace Park, Koch share lead in Korea Jeju Island, South Korea (Sports Network) - Gr... Sports
3 T. rex #39;s smaller ancestor was covered with... Fossil remains of the oldest and smallest know... Sci/Tech
4 Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the... Sports
5 China's Economic Boom Still Roaring (AP) AP - China's economic boom is still roaring de... Business
6 Guo tucks away gold for China China's Guo Jingjing easily won the women's 3-... Sports
7 Microsoft Order Cancelled by Beijing The Chinese city of Beijing has cancelled an o... Sci/Tech
8 Powell wins backing for new NKorea pressure, b... AFP - US Secretary of State Colin Powell wrapp... World
9 U.S. to Urge China to Push for More N.Korea Talks BEIJING (Reuters) - Secretary of State Colin ... World
This result is based on text matching, with China
or Sports
appearing in the description.
Harnessing the Power of Vector Search for Data Intelligence
All the source code for this article can be found in this repository.
GreptimeDB’s vector search functionality helps users find accurate answers in vast datasets, and this is just the beginning. In the process of digital transformation, vector databases will become an essential tool for driving data intelligence. Whether it’s AIOps, observability, or AI applications, vector search-based technologies will bring unprecedented efficiency and accuracy to the industry.
If you’re interested in vector search and GreptimeDB, we encourage you to dive deeper and join our community to help create a smarter future.
About Greptime
Greptime offers industry-leading time series database products and solutions to empower IoT and Observability scenarios, enabling enterprises to uncover valuable insights from their data with less time, complexity, and cost.
GreptimeDB is an open-source, high-performance time-series database offering unified storage and analysis for metrics, logs, and events. Try it out instantly with GreptimeCloud, a fully-managed DBaaS solution—no deployment needed!
The Edge-Cloud Integrated Solution combines multimodal edge databases with cloud-based GreptimeDB to optimize IoT edge scenarios, cutting costs while boosting data performance.
Star us on GitHub or join GreptimeDB Community on Slack to get connected.