The Paradox of Self-Monitoring
When using GreptimeDB as the time-series storage backend, it becomes necessary to monitor GreptimeDB itself. This type of monitoring, which we refer to as meta-monitoring, presents an intriguing paradox: despite being a powerful time-series database, GreptimeDB still requires:
Prometheus for monitoring GreptimeDB’s Metrics.
Loki (or another log management system) for monitoring GreptimeDB’s Logs.
Jaeger (or another tracing service) for tracking GreptimeDB’s Tracing data.
In other words, even though GreptimeDB is a robust time-series database, you need to install 2-3 additional systems to monitor its own operation. While this paradox is common in today’s observability tech stacks, we believe a well-designed time-series database should be capable of self-monitoring — i.e., using the same time-series database to monitor itself.
Upon closer inspection, the core of this paradox is: is your time-series database capable of storing and computing multiple time-series types (Metrics, Logs, and Traces)? Once a time-series database achieves this, self-monitoring becomes relatively simple.
After a period of iteration and development, GreptimeDB has implemented a preliminary self-monitoring solution. Now, we can easily use GreptimeDB to monitor most of its own monitoring data.
Benefits of Self-Monitoring
Self-monitoring brings several key benefits:
Homogeneity: The biggest advantage of self-monitoring is the homogeneity of the tech stack. Unified technology stacks significantly reduce operational costs. Users no longer need to deploy different types of time-series databases for various monitoring purposes; everything becomes streamlined.
Storage and Fusion Querying of Multiple Time-Series Data Types: Users can query different data types (Metrics, Logs, and Traces) together within a single SQL statement, simplifying troubleshooting in GreptimeDB.
Resource Efficiency: With just one core database (typically the standalone version), users can monitor GreptimeDB without needing to deploy multiple additional databases, saving CPU and memory resources.
GreptimeDB’s self-monitoring solution is first implemented in the GreptimeDB Enterprise Dashboard. Additionally, the GreptimeDB Operator now offers self-monitoring in cluster mode. Users can easily activate self-monitoring and gather both Metrics and Logs by making a few simple configurations in their GreptimeDB Kubernetes CRD. Paired with the provided Grafana dashboard, this gives users a ready-to-use, comprehensive observability solution.
Interested readers can check out the "Quick Start" guide for our cluster version here.
In the GreptimeDB Enterprise Dashboard, we not only allow the combined viewing of Metrics and Logs using Grafana but also explore advanced query capabilities to provide higher-level automated diagnostics by merging these two data types.
How It Works
Given that GreptimeDB’s underlying design already integrates Metrics and Logs, we essentially have a time-series database capable of storing and querying both Metrics and Logs. This solves the core issue of data storage and computation.
Next, we need to address two critical questions:
How do we collect the data?
How do we integrate this with the GreptimeDB Operator?
The first question is easy to answer. The community already provides a high-performance agent that aligns with our vision: Vector.
Interested readers can explore our past articles on Vector:
Once Vector is installed, it can collect various types of monitoring data simultaneously. Users no longer need to deploy different types of collectors for different monitoring data. In Kubernetes, Vector can be deployed in the optimal DaemonSet mode, or more flexibly in Sidecar mode.
The second question addresses deployment. Our goal is to offer users the most seamless experience, avoiding the complexity of maintaining multiple charts in Helm. Thus, we’ve added a monitoring
field to the GreptimeDBCluster
CRD. Users can easily enable self-monitoring by adding the following configuration to their GreptimeDBCluster
CR:
apiVersion: greptime.io/v1alpha1
kind: GreptimeDBCluster
...
spec:
...
monitoring:
enabled: true
...
Once this configuration is set, the GreptimeDB Operator will:
Deploy a Standalone GreptimeDB to monitor the GreptimeDB cluster: Although GreptimeDB could write its monitoring data into itself, we want to avoid impacting the real monitoring data with meta-monitoring queries and writes. To ensure resource isolation, we choose not to use this mode. For most GreptimeDB clusters, deploying a Standalone instance for monitoring is sufficient and simplifies deployment and maintenance.
Deploy a Vector Sidecar to collect both Metrics and Logs: The Operator deploys a low-resource Vector Sidecar inside each GreptimeDB Pod to collect both Metrics and Logs (including slow queries) and write them into the Standalone instance.
While the Sidecar mode uses Pod resources and may introduce potential stability concerns, its flexibility makes it suitable for small to medium-sized clusters. For more performance-sensitive environments, the DaemonSet mode for Vector can be deployed.
To provide a true "out-of-the-box" experience, our official Helm chart supports Grafana deployment. When users enable grafana.enabled=true
, a Grafana instance is deployed, with data sources and dashboards configured automatically.
Interested readers can follow our Quick Start guide to experience this process firsthand.
This allows users to deploy both the GreptimeDB cluster and its monitoring with a single command, ensuring the systems are isolated and do not interfere with each other.
Going Further
Curious readers may ask: Who monitors the Standalone GreptimeDB instance? A simple solution could be to apply self-monitoring here as well, writing the monitoring data into itself and utilizing its computational power for querying.
We can afford this in meta-monitoring because:
The resource isolation requirements for meta-monitoring are lower, and we can tolerate the performance overhead of querying self-monitoring data. This overhead does not significantly affect the monitored workload.
The monitoring data for meta-monitoring is relatively small, and the queries are straightforward. Self-monitoring is more than sufficient in this case.
However, this may not be the most optimal approach. If the meta-monitoring system itself fails, this method would also fail. An alternative approach is to use a monitoring service with independent failure domains, such as a public cloud monitoring service, to monitor the meta-monitoring service.
We are still working on this capability and look forward to delivering it in the future.
The Plans of Future
GreptimeDB’s monitoring bootstrapping capability is a small but promising experiment in integrating time-series data storage and computation. Moving forward, we plan to:
Integrate Vector's operational workflow into the GreptimeDB Operator, enabling users to manage both GreptimeDB and Vector with ease.
Further explore the fusion of time-series data storage and computation. For example, much like the Prometheus Operator, we could define models that allow users to:
Collect data (e.g., using Pod or Service selectors).
Preprocess data.
Storage the control data.
And more.
In this unified model, users would only need to focus on producing and consuming monitoring data, without worrying about the underlying storage and computation details.
About Greptime
Greptime offers industry-leading time series database products and solutions to empower IoT and Observability scenarios, enabling enterprises to uncover valuable insights from their data with less time, complexity, and cost.
GreptimeDB is an open-source, high-performance time-series database offering unified storage and analysis for metrics, logs, and events. Try it out instantly with GreptimeCloud, a fully-managed DBaaS solution—no deployment needed!
The Edge-Cloud Integrated Solution combines multimodal edge databases with cloud-based GreptimeDB to optimize IoT edge scenarios, cutting costs while boosting data performance.
Star us on GitHub or join GreptimeDB Community on Slack to get connected.