Kubernetes monitoring is table stakes. The Prometheus + Grafana stack has been around for years, and plenty of engineers can write PromQL in their sleep. But when you try to correlate CPU usage with request rates, or dynamically classify resources based on utilization, PromQL gets awkward.
PromQL excels at time-series aggregation but struggles with complex JOINs, conditional logic, and has limited expressiveness for aggregation and window functions—exactly where SQL shines. Combine both and complex analysis becomes straightforward.
GreptimeDB's TQL (Telemetry Query Language) makes this possible: embed PromQL directly in SQL, combine queries with CTEs, and perform true hybrid analysis.
TQL: PromQL Inside SQL
TQL is GreptimeDB's SQL extension for executing PromQL. The basic syntax:
TQL EVAL (start, end, step) promql_expressionstart and end support rich time expressions:
now()— current timenow() - interval '1' hour— 1 hour agodate_trunc('day', now())— midnight today'2024-01-01T00:00:00Z'— RFC3339 format
Query CPU usage over the past hour:
TQL EVAL (now() - interval '1' hour, now(), '5m')
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m]))Use AS to alias result columns for easier SQL processing:
TQL EVAL (now() - interval '1' hour, now(), '5m')
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m])) AS cpu_coresNow the column is named cpu_cores—much easier to reference in SQL.
Execution Environment
TQL is a SQL extension, so anywhere you run SQL against GreptimeDB works:
- CLI: MySQL (
mysql -h host -P 4002) or PostgreSQL (psql -h host -p 4003) - Programming languages: JDBC, Python DB-API, Go database/sql
- Visualization: Grafana (MySQL/PostgreSQL data source), DBeaver, DataGrip
- HTTP API:
/v1/sqlendpoint
For K8s monitoring, the common setup is configuring a MySQL data source in Grafana pointing to GreptimeDB (port 4002), then writing TQL in dashboard query editors. You get PromQL's time-series power plus SQL's analytical capabilities, with results visualized directly.
How to ingest K8s metrics? See Kubernetes Cluster Metrics Monitoring to learn how to use Prometheus Remote Write.
CTE + TQL: The Core Pattern
TQL's real power comes from combining with CTEs (Common Table Expressions). A CTE is a temporary result set referenced in the main query—useful for breaking complex queries into readable chunks.
Basic pattern:
WITH metrics_cte AS (
TQL EVAL (start, end, step) promql_expression AS value_alias
)
SELECT * FROM metrics_cte WHERE value_alias > threshold;Key points:
- Time column depends on data source: Prometheus Remote Write uses
greptime_timestamp; custom tables might usets - Use AS aliasing: Explicit names are more reliable than auto-generated ones
- Only TQL EVAL works: Can't use
TQL EXPLAINorTQL ANALYZEin CTEs
Find pods using more than 0.5 CPU cores, sorted by usage.
Note that container_cpu_usage_seconds_total is a Prometheus counter—it records cumulative CPU seconds. You need rate() to convert it to per-second usage. Pure SQL can't handle counter rate calculations; pure PromQL's filtering and sorting isn't flexible enough—hybrid queries solve this:
WITH cpu_usage AS (
TQL EVAL (now() - interval '1' hour, now(), '5m')
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
AS cpu_cores
)
SELECT
greptime_timestamp as ts,
namespace,
pod,
cpu_cores
FROM cpu_usage
WHERE cpu_cores > 0.5
ORDER BY cpu_cores DESC
LIMIT 20;PromQL calculates CPU usage, SQL handles filtering and sorting. Pure PromQL can do this too, but SQL syntax is more intuitive—especially as conditions get complex.
Use Case 1: Tiered Resource Alerts
Set different alert levels based on resource usage. Pure PromQL needs multiple rules; SQL's CASE WHEN handles it in one query:
WITH cpu_usage AS (
TQL EVAL (now() - interval '1' hour, now(), '5m')
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
AS cpu_cores
)
SELECT
greptime_timestamp as ts,
namespace,
pod,
ROUND(cpu_cores, 3) as cpu_cores,
CASE
WHEN cpu_cores > 2 THEN 'critical'
WHEN cpu_cores > 1 THEN 'warning'
ELSE 'normal'
END as alert_level
FROM cpu_usage
WHERE cpu_cores > 0.1
ORDER BY cpu_cores DESC;One query, tiered classification, critical issues at the top.
Use Case 2: Cross-Metric Correlation
How much CPU does each service consume per request? Useful for code efficiency evaluation or capacity planning.
Correlate CPU with request volume. Pure PromQL is convoluted; CTE + JOIN is natural (using http_requests_total here—your cluster may use a different metric name, adjust accordingly):
WITH
cpu_data AS (
TQL EVAL (now() - interval '1' hour, now(), '5m')
sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
AS cpu_cores
),
request_data AS (
TQL EVAL (now() - interval '1' hour, now(), '5m')
sum by (pod) (rate(http_requests_total[5m]))
AS req_per_sec
)
SELECT
c.greptime_timestamp as ts,
c.pod,
ROUND(c.cpu_cores, 3) as cpu_cores,
ROUND(r.req_per_sec, 1) as req_per_sec,
ROUND(c.cpu_cores / r.req_per_sec * 1000, 2) as cpu_per_1k_requests
FROM cpu_data c
JOIN request_data r
ON c.greptime_timestamp = r.greptime_timestamp
AND c.pod = r.pod
WHERE r.req_per_sec > 1
ORDER BY cpu_per_1k_requests DESC;CPU per 1,000 requests. Higher values = resource-hungry services that may need optimization.
Tip: JOIN conditions need both timestamp and label matching (here,
pod). Ensure both TQL queries have consistentby (...)dimensions.
Use Case 3: API Response Time P99 Analysis
SLO monitoring is standard practice. Find endpoints with highest P99 latency. PromQL's histogram_quantile computes percentiles; SQL makes sorting and filtering easier.
Using
http_request_duration_seconds_buckethere—the Prometheus community standard. If you're using Istio, Nginx Ingress, etc., the metric name may differ.
WITH latency_p99 AS (
TQL EVAL (now() - interval '1' hour, now(), '5m')
histogram_quantile(0.99, sum by (service, endpoint, le) (rate(http_request_duration_seconds_bucket[5m])))
AS p99_seconds
)
SELECT
greptime_timestamp as ts,
service,
endpoint,
ROUND(p99_seconds * 1000, 1) as p99_ms,
CASE
WHEN p99_seconds > 1 THEN 'SLO violated'
WHEN p99_seconds > 0.5 THEN 'at risk'
ELSE 'healthy'
END as slo_status
FROM latency_p99
WHERE p99_seconds IS NOT NULL
ORDER BY p99_seconds DESC
LIMIT 20;Top 20 endpoints by P99 latency, with SLO status. Over 1 second = SLO violated, quick triage.
Use Case 4: Pod Restart Anomaly Detection
Frequent restarts are a red flag. TQL counts restarts, SQL aggregates for anomaly detection:
WITH restart_counts AS (
TQL EVAL (now() - interval '6' hour, now(), '30m')
sum by (namespace, pod) (increase(kube_pod_container_status_restarts_total[30m]))
AS restarts
)
SELECT
namespace,
pod,
CAST(SUM(restarts) AS INT) as total_restarts,
CAST(MAX(restarts) AS INT) as max_per_30m,
CAST(AVG(restarts) AS INT) as avg_per_30m
FROM restart_counts
WHERE restarts > 0
GROUP BY namespace, pod
HAVING SUM(restarts) >= 3
ORDER BY total_restarts DESC;HAVING filters sporadic restarts, focusing on pods with real problems.
Best Practices
1. Use AS Aliasing
Auto-generated PromQL column names can be long and may change between versions. Explicit naming makes queries robust.
2. Check Time Column Names in JOINs
As mentioned earlier, Prometheus Remote Write uses greptime_timestamp; custom tables may differ. Don't get it wrong in JOINs.
3. Keep Aggregation Dimensions Consistent
When JOINing TQL CTEs, ensure by (...) dimensions match. Otherwise, JOINs produce Cartesian products or empty results.
4. Set Query Ranges Appropriately
(start, end, step) affects data volume and performance. Short ranges/small steps for real-time; long ranges/large steps for trends.
5. Leverage GreptimeDB's SQL
Built on Apache DataFusion, GreptimeDB provides rich functions:
- Aggregate:
SUM,AVG,MIN,MAX,COUNT - Window:
ROW_NUMBER,RANK,LAG,LEAD - Math:
ROUND,ABS,FLOOR,CEIL,CLAMP - Time:
date_trunc,date_add,date_format,to_unixtime - Conditional:
CASE WHEN,COALESCE,NULLIF
Full list: GreptimeDB SQL Functions.
Wrapping Up
PromQL excels at time-series aggregation; SQL excels at correlation and conditional processing. With TQL + CTEs:
- PromQL handles time-series aggregation (rate, histogram_quantile, increase)
- SQL handles filtering, JOINs, conditionals, window analysis
- Complex analysis breaks into readable chunks
For K8s monitoring, analyses that required application-layer processing can now happen in the database.
Next time you face a complex monitoring task, try hybrid queries.


