🛠 This page is for engineering teams self-hosting their own Lightdash instance. If you want to monitor usage and analytics, go to the Usage analytics guide.
Enabling Prometheus metrics
By default, Prometheus metrics are disabled in Lightdash. To enable them, set the following environment variable:Configuration options
You can customize the Prometheus metrics endpoint using the following environment variables:| Variable | Description | Required? | Default |
|---|---|---|---|
LIGHTDASH_PROMETHEUS_ENABLED | Enables/Disables Prometheus metrics endpoint | false | |
LIGHTDASH_PROMETHEUS_PORT | Port for Prometheus metrics endpoint | 9090 | |
LIGHTDASH_PROMETHEUS_PATH | Path for Prometheus metrics endpoint | /metrics | |
LIGHTDASH_PROMETHEUS_PREFIX | Prefix for metric names | ||
LIGHTDASH_GC_DURATION_BUCKETS | Buckets for duration histogram in seconds | 0.001, 0.01, 0.1, 1, 2, 5 | |
LIGHTDASH_EVENT_LOOP_MONITORING_PRECISION | Precision for event loop monitoring in milliseconds. Must be greater than zero. | 10 | |
LIGHTDASH_PROMETHEUS_LABELS | Labels to add to all metrics. Must be valid JSON | ||
LIGHTDASH_CUSTOM_METRICS_CONFIG_PATH | Path to a JSON config file for custom event-driven counter metrics |
Available metrics
Lightdash exposes the following metrics:Process metrics
These metrics provide information about the Node.js process running Lightdash:| Metric | Type | Description |
|---|---|---|
process_cpu_user_seconds_total | counter | Total user CPU time spent in seconds |
process_cpu_system_seconds_total | counter | Total system CPU time spent in seconds |
process_cpu_seconds_total | counter | Total user and system CPU time spent in seconds |
process_start_time_seconds | gauge | Start time of the process since unix epoch in seconds |
process_resident_memory_bytes | gauge | Resident memory size in bytes |
process_virtual_memory_bytes | gauge | Virtual memory size in bytes |
process_heap_bytes | gauge | Process heap size in bytes |
process_open_fds | gauge | Number of open file descriptors |
process_max_fds | gauge | Maximum number of open file descriptors |
Node.js metrics
These metrics provide information about the Node.js runtime:| Metric | Type | Description |
|---|---|---|
nodejs_eventloop_lag_seconds | gauge | Lag of event loop in seconds |
nodejs_eventloop_lag_min_seconds | gauge | The minimum recorded event loop delay |
nodejs_eventloop_lag_max_seconds | gauge | The maximum recorded event loop delay |
nodejs_eventloop_lag_mean_seconds | gauge | The mean of the recorded event loop delays |
nodejs_eventloop_lag_stddev_seconds | gauge | The standard deviation of the recorded event loop delays |
nodejs_eventloop_lag_p50_seconds | gauge | The 50th percentile of the recorded event loop delays |
nodejs_eventloop_lag_p90_seconds | gauge | The 90th percentile of the recorded event loop delays |
nodejs_eventloop_lag_p99_seconds | gauge | The 99th percentile of the recorded event loop delays |
nodejs_active_resources | gauge | Number of active resources that are currently keeping the event loop alive, grouped by async resource type |
nodejs_active_resources_total | gauge | Total number of active resources |
nodejs_active_handles | gauge | Number of active libuv handles grouped by handle type |
nodejs_active_handles_total | gauge | Total number of active handles |
nodejs_active_requests | gauge | Number of active libuv requests grouped by request type |
nodejs_active_requests_total | gauge | Total number of active requests |
nodejs_heap_size_total_bytes | gauge | Process heap size from Node.js in bytes |
nodejs_heap_size_used_bytes | gauge | Process heap size used from Node.js in bytes |
nodejs_external_memory_bytes | gauge | Node.js external memory size in bytes |
nodejs_heap_space_size_total_bytes | gauge | Process heap space size total from Node.js in bytes |
nodejs_heap_space_size_used_bytes | gauge | Process heap space size used from Node.js in bytes |
nodejs_heap_space_size_available_bytes | gauge | Process heap space size available from Node.js in bytes |
nodejs_version_info | gauge | Node.js version info |
nodejs_gc_duration_seconds | histogram | Garbage collection duration by kind |
nodejs_eventloop_utilization | gauge | The calculated Event Loop Utilization (ELU) as a percentage |
PostgreSQL metrics
These metrics provide information about the PostgreSQL connection pool:| Metric | Type | Description | Labels |
|---|---|---|---|
pg_pool_max_size | gauge | Max size of the PG pool | |
pg_pool_size | gauge | Current size of the PG pool | |
pg_active_connections | gauge | Number of active connections in the PG pool | |
pg_idle_connections | gauge | Number of idle connections in the PG pool | |
pg_queued_queries | gauge | Number of queries waiting in the PG pool queue | |
pg_connection_acquire_time | histogram | Time to acquire a connection from the PG pool in milliseconds | |
pg_query_duration | histogram | Histogram of PG query execution time in milliseconds |
Queue metrics
| Metric | Type | Description |
|---|---|---|
queue_size | gauge | Number of jobs in the queue |
Query metrics
These metrics track query execution performance. Thecontext label is either scheduled or interactive based on the execution context.
| Metric | Type | Description | Labels |
|---|---|---|---|
lightdash_query_status_total | counter | Total number of queries by terminal status | status, context |
lightdash_query_state_transitions_total | counter | Query state transitions | from, to, context |
lightdash_query_queue_wait_duration_seconds | histogram | Time spent waiting in queue before execution | context |
lightdash_query_total_duration_seconds | histogram | Total query duration from creation to results ready | context |
lightdash_query_warehouse_duration_seconds | histogram | Warehouse query execution duration | warehouse_type, context |
lightdash_query_overhead_duration_seconds | histogram | Lightdash overhead: total duration minus warehouse execution time | context |
lightdash_query_cache_hit_total | counter | Total number of query cache hits and misses | result, context, has_pre_aggregate_match |
Pre-aggregate metrics
These metrics track the pre-aggregate system, including materialization, DuckDB resolution, and file management:| Metric | Type | Description | Labels |
|---|---|---|---|
lightdash_pre_aggregate_match_total | counter | Total number of pre-aggregate match attempts | result, miss_reason, format |
lightdash_pre_aggregate_materialization_total | counter | Total number of pre-aggregate materializations by outcome | status, trigger |
lightdash_pre_aggregate_active_materializations | gauge | Current number of active pre-aggregate materializations | |
lightdash_pre_aggregate_materialization_duration_seconds | histogram | Pre-aggregate materialization duration | status, trigger |
lightdash_pre_aggregate_materialization_poll_duration_seconds | histogram | Time spent polling for materialization query completion in seconds | status, trigger |
lightdash_pre_aggregate_materialization_warehouse_duration_seconds | histogram | Warehouse execution time during materialization in seconds | status, trigger |
lightdash_pre_aggregate_materialization_promote_duration_seconds | histogram | Time to check file size and promote materialization to active in seconds | status, trigger |
lightdash_pre_aggregate_materialization_file_size_bytes | histogram | File size of pre-aggregate materialization in bytes | format |
lightdash_pre_aggregate_parquet_conversion_duration_seconds | histogram | Duration of JSONL to Parquet conversion | status |
lightdash_pre_aggregate_duckdb_resolution_total | counter | Total number of DuckDB pre-aggregate resolution attempts | status, reason |
lightdash_pre_aggregate_duckdb_resolution_duration_seconds | histogram | DuckDB pre-aggregate resolution duration | status |
lightdash_pre_aggregate_duckdb_query_latency_seconds | histogram | Total DuckDB query latency in seconds | |
lightdash_pre_aggregate_duckdb_parquet_read_duration_seconds | histogram | Time spent in READ_PARQUET operators in seconds | |
lightdash_pre_aggregate_duckdb_bytes_read | histogram | Bytes read from S3/parquet by DuckDB queries | |
lightdash_pre_aggregate_duckdb_scan_amplification | histogram | Ratio of rows scanned to rows returned in DuckDB queries | |
lightdash_pre_aggregate_fallback_total | counter | Total number of opportunistic pre-aggregate fallbacks to warehouse | reason |
AI agent metrics
These metrics track the performance of the AI agent:| Metric | Type | Description | Labels |
|---|---|---|---|
ai_agent_generate_response_duration_ms | histogram | AI agent generate response time in milliseconds | |
ai_agent_stream_response_duration_ms | histogram | AI agent stream response time in milliseconds | |
ai_agent_stream_first_chunk_ms | histogram | AI agent time to first chunk (any type) | |
ai_agent_ttft_ms | histogram | AI agent time to first token (TTFT) | model, mode |
S3 metrics
| Metric | Type | Description | Labels |
|---|---|---|---|
lightdash_s3_results_upload_duration_seconds | histogram | S3 results upload duration | source |
Custom event metrics
Lightdash supports operator-configurable Prometheus counter metrics that are driven by application events. These are defined via a JSON configuration file specified by theLIGHTDASH_CUSTOM_METRICS_CONFIG_PATH environment variable.
Each entry in the config file creates a counter metric that increments when a matching application event fires. This allows you to track custom business-level metrics such as user logins or query executions without modifying the application code.
Using metrics for monitoring and alerting
You can use these metrics to create dashboards and alerts in your monitoring system. Some common use cases include:- Monitoring memory usage and setting alerts for potential memory leaks
- Tracking PostgreSQL connection pool utilization
- Monitoring event loop lag to detect performance issues
- Setting up alerts for high CPU usage
- High memory usage:
process_resident_memory_bytes > threshold - Event loop lag:
nodejs_eventloop_lag_p99_seconds > threshold - Database connection pool saturation:
pg_active_connections / pg_pool_max_size > 0.8
OpenTelemetry support
Lightdash metrics are also compatible with OpenTelemetry. You can use the OpenTelemetry Collector with the Prometheus receiver to scrape Lightdash’s Prometheus metrics endpoint and export them to any OpenTelemetry-compatible backend. Example OpenTelemetry Collector configuration:Setting up a Prometheus server
If you don’t already have a Prometheus server set up, here are some resources to help you get started:General Prometheus setup
- Prometheus Getting Started Guide - Official documentation on how to install and configure Prometheus
- Prometheus Installation - Different ways to install Prometheus
- Prometheus Configuration - Detailed configuration options for Prometheus
Setting up Prometheus in Google Cloud Platform (GCP)
- Google Cloud Managed Service for Prometheus - Google Cloud’s managed Prometheus service
- Installing Prometheus on GKE - Setting up Prometheus on Google Kubernetes Engine
- Google Cloud Operations Suite Integration - Integrating Prometheus with Google Cloud Operations Suite
Setting up Prometheus in Amazon Web Services (AWS)
- Amazon Managed Service for Prometheus - AWS managed Prometheus service
- Getting Started with Amazon Managed Service for Prometheus - Official AWS documentation
- Setting up Prometheus on Amazon EKS - Deploying Prometheus on Amazon Elastic Kubernetes Service