Skip to content
On this page

When PromQL Meets SQL: Hybrid Queries for Kubernetes Monitoring

Learn how to combine PromQL time-series aggregation with SQL analytical capabilities using TQL and CTEs in GreptimeDB for advanced Kubernetes monitoring analysis.

Kubernetes monitoring is table stakes. The Prometheus + Grafana stack has been around for years, and plenty of engineers can write PromQL in their sleep. But when you try to correlate CPU usage with request rates, or dynamically classify resources based on utilization, PromQL gets awkward.

PromQL excels at time-series aggregation but struggles with complex JOINs, conditional logic, and has limited expressiveness for aggregation and window functions—exactly where SQL shines. Combine both and complex analysis becomes straightforward.

GreptimeDB's TQL (Telemetry Query Language) makes this possible: embed PromQL directly in SQL, combine queries with CTEs, and perform true hybrid analysis.

TQL: PromQL Inside SQL

TQL is GreptimeDB's SQL extension for executing PromQL. The basic syntax:

sql
TQL EVAL (start, end, step) promql_expression

start and end support rich time expressions:

  • now() — current time
  • now() - interval '1' hour — 1 hour ago
  • date_trunc('day', now()) — midnight today
  • '2024-01-01T00:00:00Z' — RFC3339 format

Query CPU usage over the past hour:

sql
TQL EVAL (now() - interval '1' hour, now(), '5m')
    sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m]))

Use AS to alias result columns for easier SQL processing:

sql
TQL EVAL (now() - interval '1' hour, now(), '5m')
    sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m])) AS cpu_cores

Now the column is named cpu_cores—much easier to reference in SQL.

Execution Environment

TQL is a SQL extension, so anywhere you run SQL against GreptimeDB works:

  • CLI: MySQL (mysql -h host -P 4002) or PostgreSQL (psql -h host -p 4003)
  • Programming languages: JDBC, Python DB-API, Go database/sql
  • Visualization: Grafana (MySQL/PostgreSQL data source), DBeaver, DataGrip
  • HTTP API: /v1/sql endpoint

For K8s monitoring, the common setup is configuring a MySQL data source in Grafana pointing to GreptimeDB (port 4002), then writing TQL in dashboard query editors. You get PromQL's time-series power plus SQL's analytical capabilities, with results visualized directly.

How to ingest K8s metrics? See Kubernetes Cluster Metrics Monitoring to learn how to use Prometheus Remote Write.

CTE + TQL: The Core Pattern

TQL's real power comes from combining with CTEs (Common Table Expressions). A CTE is a temporary result set referenced in the main query—useful for breaking complex queries into readable chunks.

Basic pattern:

sql
WITH metrics_cte AS (
    TQL EVAL (start, end, step) promql_expression AS value_alias
)
SELECT * FROM metrics_cte WHERE value_alias > threshold;

Key points:

  1. Time column depends on data source: Prometheus Remote Write uses greptime_timestamp; custom tables might use ts
  2. Use AS aliasing: Explicit names are more reliable than auto-generated ones
  3. Only TQL EVAL works: Can't use TQL EXPLAIN or TQL ANALYZE in CTEs

Find pods using more than 0.5 CPU cores, sorted by usage.

Note that container_cpu_usage_seconds_total is a Prometheus counter—it records cumulative CPU seconds. You need rate() to convert it to per-second usage. Pure SQL can't handle counter rate calculations; pure PromQL's filtering and sorting isn't flexible enough—hybrid queries solve this:

sql
WITH cpu_usage AS (
    TQL EVAL (now() - interval '1' hour, now(), '5m')
    sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
    AS cpu_cores
)
SELECT
    greptime_timestamp as ts,
    namespace,
    pod,
    cpu_cores
FROM cpu_usage
WHERE cpu_cores > 0.5
ORDER BY cpu_cores DESC
LIMIT 20;

PromQL calculates CPU usage, SQL handles filtering and sorting. Pure PromQL can do this too, but SQL syntax is more intuitive—especially as conditions get complex.

Use Case 1: Tiered Resource Alerts

Set different alert levels based on resource usage. Pure PromQL needs multiple rules; SQL's CASE WHEN handles it in one query:

sql
WITH cpu_usage AS (
    TQL EVAL (now() - interval '1' hour, now(), '5m')
    sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
    AS cpu_cores
)
SELECT
    greptime_timestamp as ts,
    namespace,
    pod,
    ROUND(cpu_cores, 3) as cpu_cores,
    CASE
        WHEN cpu_cores > 2 THEN 'critical'
        WHEN cpu_cores > 1 THEN 'warning'
        ELSE 'normal'
    END as alert_level
FROM cpu_usage
WHERE cpu_cores > 0.1
ORDER BY cpu_cores DESC;

One query, tiered classification, critical issues at the top.

Use Case 2: Cross-Metric Correlation

How much CPU does each service consume per request? Useful for code efficiency evaluation or capacity planning.

Correlate CPU with request volume. Pure PromQL is convoluted; CTE + JOIN is natural (using http_requests_total here—your cluster may use a different metric name, adjust accordingly):

sql
WITH
    cpu_data AS (
        TQL EVAL (now() - interval '1' hour, now(), '5m')
        sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
        AS cpu_cores
    ),
    request_data AS (
        TQL EVAL (now() - interval '1' hour, now(), '5m')
        sum by (pod) (rate(http_requests_total[5m]))
        AS req_per_sec
    )
SELECT
    c.greptime_timestamp as ts,
    c.pod,
    ROUND(c.cpu_cores, 3) as cpu_cores,
    ROUND(r.req_per_sec, 1) as req_per_sec,
    ROUND(c.cpu_cores / r.req_per_sec * 1000, 2) as cpu_per_1k_requests
FROM cpu_data c
JOIN request_data r
    ON c.greptime_timestamp = r.greptime_timestamp
    AND c.pod = r.pod
WHERE r.req_per_sec > 1
ORDER BY cpu_per_1k_requests DESC;

CPU per 1,000 requests. Higher values = resource-hungry services that may need optimization.

Tip: JOIN conditions need both timestamp and label matching (here, pod). Ensure both TQL queries have consistent by (...) dimensions.

Use Case 3: API Response Time P99 Analysis

SLO monitoring is standard practice. Find endpoints with highest P99 latency. PromQL's histogram_quantile computes percentiles; SQL makes sorting and filtering easier.

Using http_request_duration_seconds_bucket here—the Prometheus community standard. If you're using Istio, Nginx Ingress, etc., the metric name may differ.

sql
WITH latency_p99 AS (
    TQL EVAL (now() - interval '1' hour, now(), '5m')
    histogram_quantile(0.99, sum by (service, endpoint, le) (rate(http_request_duration_seconds_bucket[5m])))
    AS p99_seconds
)
SELECT
    greptime_timestamp as ts,
    service,
    endpoint,
    ROUND(p99_seconds * 1000, 1) as p99_ms,
    CASE
        WHEN p99_seconds > 1 THEN 'SLO violated'
        WHEN p99_seconds > 0.5 THEN 'at risk'
        ELSE 'healthy'
    END as slo_status
FROM latency_p99
WHERE p99_seconds IS NOT NULL
ORDER BY p99_seconds DESC
LIMIT 20;

Top 20 endpoints by P99 latency, with SLO status. Over 1 second = SLO violated, quick triage.

Use Case 4: Pod Restart Anomaly Detection

Frequent restarts are a red flag. TQL counts restarts, SQL aggregates for anomaly detection:

sql
WITH restart_counts AS (
    TQL EVAL (now() - interval '6' hour, now(), '30m')
    sum by (namespace, pod) (increase(kube_pod_container_status_restarts_total[30m]))
    AS restarts
)
SELECT
    namespace,
    pod,
    CAST(SUM(restarts) AS INT) as total_restarts,
    CAST(MAX(restarts) AS INT) as max_per_30m,
    CAST(AVG(restarts) AS INT) as avg_per_30m
FROM restart_counts
WHERE restarts > 0
GROUP BY namespace, pod
HAVING SUM(restarts) >= 3
ORDER BY total_restarts DESC;

HAVING filters sporadic restarts, focusing on pods with real problems.

Best Practices

1. Use AS Aliasing

Auto-generated PromQL column names can be long and may change between versions. Explicit naming makes queries robust.

2. Check Time Column Names in JOINs

As mentioned earlier, Prometheus Remote Write uses greptime_timestamp; custom tables may differ. Don't get it wrong in JOINs.

3. Keep Aggregation Dimensions Consistent

When JOINing TQL CTEs, ensure by (...) dimensions match. Otherwise, JOINs produce Cartesian products or empty results.

4. Set Query Ranges Appropriately

(start, end, step) affects data volume and performance. Short ranges/small steps for real-time; long ranges/large steps for trends.

5. Leverage GreptimeDB's SQL

Built on Apache DataFusion, GreptimeDB provides rich functions:

  • Aggregate: SUM, AVG, MIN, MAX, COUNT
  • Window: ROW_NUMBER, RANK, LAG, LEAD
  • Math: ROUND, ABS, FLOOR, CEIL, CLAMP
  • Time: date_trunc, date_add, date_format, to_unixtime
  • Conditional: CASE WHEN, COALESCE, NULLIF

Full list: GreptimeDB SQL Functions.

Wrapping Up

PromQL excels at time-series aggregation; SQL excels at correlation and conditional processing. With TQL + CTEs:

  1. PromQL handles time-series aggregation (rate, histogram_quantile, increase)
  2. SQL handles filtering, JOINs, conditionals, window analysis
  3. Complex analysis breaks into readable chunks

For K8s monitoring, analyses that required application-layer processing can now happen in the database.

Next time you face a complex monitoring task, try hybrid queries.

References

Join our community

Get the latest updates and discuss with other users.