Configure Prometheus metrics for self-hosted Lightdash

🛠 This page is for engineering teams self-hosting their own Lightdash instance. If you want to monitor usage and analytics, go to the Usage analytics guide.

Lightdash can expose Prometheus metrics to help you monitor the performance and health of your Lightdash instance. This guide explains how to enable and configure Prometheus metrics for your self-hosted Lightdash deployment.

Enabling Prometheus metrics

By default, Prometheus metrics are disabled in Lightdash. To enable them, set the following environment variable:

LIGHTDASH_PROMETHEUS_ENABLED=true

Configuration options

You can customize the Prometheus metrics endpoint using the following environment variables:

Variable	Description	Default
`LIGHTDASH_PROMETHEUS_ENABLED`	Enables/Disables Prometheus metrics endpoint	`false`
`LIGHTDASH_PROMETHEUS_PORT`	Port for Prometheus metrics endpoint	`9090`
`LIGHTDASH_PROMETHEUS_PATH`	Path for Prometheus metrics endpoint	`/metrics`
`LIGHTDASH_PROMETHEUS_PREFIX`	Prefix for metric names
`LIGHTDASH_GC_DURATION_BUCKETS`	Buckets for duration histogram in seconds	`0.001, 0.01, 0.1, 1, 2, 5`
`LIGHTDASH_EVENT_LOOP_MONITORING_PRECISION`	Precision for event loop monitoring in milliseconds. Must be greater than zero.	`10`
`LIGHTDASH_PROMETHEUS_LABELS`	Labels to add to all metrics. Must be valid JSON
`LIGHTDASH_CUSTOM_METRICS_CONFIG_PATH`	Path to a JSON config file for custom event-driven counter metrics

Available metrics

Lightdash exposes the following metrics:

Process metrics

These metrics provide information about the Node.js process running Lightdash:

Metric	Type	Description
`process_cpu_user_seconds_total`	counter	Total user CPU time spent in seconds
`process_cpu_system_seconds_total`	counter	Total system CPU time spent in seconds
`process_cpu_seconds_total`	counter	Total user and system CPU time spent in seconds
`process_start_time_seconds`	gauge	Start time of the process since unix epoch in seconds
`process_resident_memory_bytes`	gauge	Resident memory size in bytes
`process_virtual_memory_bytes`	gauge	Virtual memory size in bytes
`process_heap_bytes`	gauge	Process heap size in bytes
`process_open_fds`	gauge	Number of open file descriptors
`process_max_fds`	gauge	Maximum number of open file descriptors

Node.js metrics

These metrics provide information about the Node.js runtime:

Metric	Type	Description
`nodejs_eventloop_lag_seconds`	gauge	Lag of event loop in seconds
`nodejs_eventloop_lag_min_seconds`	gauge	The minimum recorded event loop delay
`nodejs_eventloop_lag_max_seconds`	gauge	The maximum recorded event loop delay
`nodejs_eventloop_lag_mean_seconds`	gauge	The mean of the recorded event loop delays
`nodejs_eventloop_lag_stddev_seconds`	gauge	The standard deviation of the recorded event loop delays
`nodejs_eventloop_lag_p50_seconds`	gauge	The 50th percentile of the recorded event loop delays
`nodejs_eventloop_lag_p90_seconds`	gauge	The 90th percentile of the recorded event loop delays
`nodejs_eventloop_lag_p99_seconds`	gauge	The 99th percentile of the recorded event loop delays
`nodejs_active_resources`	gauge	Number of active resources that are currently keeping the event loop alive, grouped by async resource type
`nodejs_active_resources_total`	gauge	Total number of active resources
`nodejs_active_handles`	gauge	Number of active libuv handles grouped by handle type
`nodejs_active_handles_total`	gauge	Total number of active handles
`nodejs_active_requests`	gauge	Number of active libuv requests grouped by request type
`nodejs_active_requests_total`	gauge	Total number of active requests
`nodejs_heap_size_total_bytes`	gauge	Process heap size from Node.js in bytes
`nodejs_heap_size_used_bytes`	gauge	Process heap size used from Node.js in bytes
`nodejs_external_memory_bytes`	gauge	Node.js external memory size in bytes
`nodejs_heap_space_size_total_bytes`	gauge	Process heap space size total from Node.js in bytes
`nodejs_heap_space_size_used_bytes`	gauge	Process heap space size used from Node.js in bytes
`nodejs_heap_space_size_available_bytes`	gauge	Process heap space size available from Node.js in bytes
`nodejs_version_info`	gauge	Node.js version info
`nodejs_gc_duration_seconds`	histogram	Garbage collection duration by kind
`nodejs_eventloop_utilization`	gauge	The calculated Event Loop Utilization (ELU) as a percentage

PostgreSQL metrics

These metrics provide information about the PostgreSQL connection pool:

Metric	Type	Description
`pg_pool_max_size`	gauge	Max size of the PG pool
`pg_pool_size`	gauge	Current size of the PG pool
`pg_active_connections`	gauge	Number of active connections in the PG pool
`pg_idle_connections`	gauge	Number of idle connections in the PG pool
`pg_queued_queries`	gauge	Number of queries waiting in the PG pool queue
`pg_connection_acquire_time`	histogram	Time to acquire a connection from the PG pool in milliseconds
`pg_query_duration`	histogram	Histogram of PG query execution time in milliseconds

Queue metrics

Metric	Type	Description
`queue_size`	gauge	Number of jobs in the queue

Query metrics

These metrics track query execution performance. The context label is either scheduled or interactive based on the execution context.

Metric	Type	Description	Labels
`lightdash_query_status_total`	counter	Total number of queries by terminal status	`status`, `context`
`lightdash_query_state_transitions_total`	counter	Query state transitions	`from`, `to`, `context`
`lightdash_query_queue_wait_duration_seconds`	histogram	Time spent waiting in queue before execution	`context`
`lightdash_query_total_duration_seconds`	histogram	Total query duration from creation to results ready	`context`
`lightdash_query_warehouse_duration_seconds`	histogram	Warehouse query execution duration	`warehouse_type`, `context`
`lightdash_query_overhead_duration_seconds`	histogram	Lightdash overhead: total duration minus warehouse execution time	`context`
`lightdash_query_cache_hit_total`	counter	Total number of query cache hits and misses	`result`, `context`, `has_pre_aggregate_match`

Pre-aggregate metrics

These metrics track the pre-aggregate system, including materialization, DuckDB resolution, and file management:

Metric	Type	Description	Labels
`lightdash_pre_aggregate_match_total`	counter	Total number of pre-aggregate match attempts	`result`, `miss_reason`, `format`
`lightdash_pre_aggregate_materialization_total`	counter	Total number of pre-aggregate materializations by outcome	`status`, `trigger`
`lightdash_pre_aggregate_active_materializations`	gauge	Current number of active pre-aggregate materializations
`lightdash_pre_aggregate_materialization_duration_seconds`	histogram	Pre-aggregate materialization duration	`status`, `trigger`
`lightdash_pre_aggregate_materialization_poll_duration_seconds`	histogram	Time spent polling for materialization query completion in seconds	`status`, `trigger`
`lightdash_pre_aggregate_materialization_warehouse_duration_seconds`	histogram	Warehouse execution time during materialization in seconds	`status`, `trigger`
`lightdash_pre_aggregate_materialization_promote_duration_seconds`	histogram	Time to check file size and promote materialization to active in seconds	`status`, `trigger`
`lightdash_pre_aggregate_materialization_file_size_bytes`	histogram	File size of pre-aggregate materialization in bytes	`format`
`lightdash_pre_aggregate_parquet_conversion_duration_seconds`	histogram	Duration of JSONL to Parquet conversion	`status`
`lightdash_pre_aggregate_duckdb_resolution_total`	counter	Total number of DuckDB pre-aggregate resolution attempts	`status`, `reason`
`lightdash_pre_aggregate_duckdb_resolution_duration_seconds`	histogram	DuckDB pre-aggregate resolution duration	`status`
`lightdash_pre_aggregate_duckdb_query_latency_seconds`	histogram	Total DuckDB query latency in seconds
`lightdash_pre_aggregate_duckdb_parquet_read_duration_seconds`	histogram	Time spent in READ_PARQUET operators in seconds
`lightdash_pre_aggregate_duckdb_bytes_read`	histogram	Bytes read from S3/parquet by DuckDB queries
`lightdash_pre_aggregate_duckdb_scan_amplification`	histogram	Ratio of rows scanned to rows returned in DuckDB queries
`lightdash_pre_aggregate_fallback_total`	counter	Total number of opportunistic pre-aggregate fallbacks to warehouse	`reason`

AI agent metrics

These metrics track the performance of the AI agent:

Metric	Type	Description	Labels
`ai_agent_generate_response_duration_ms`	histogram	AI agent generate response time in milliseconds
`ai_agent_stream_response_duration_ms`	histogram	AI agent stream response time in milliseconds
`ai_agent_stream_first_chunk_ms`	histogram	AI agent time to first chunk (any type)
`ai_agent_ttft_ms`	histogram	AI agent time to first token (TTFT)	`model`, `mode`

S3 metrics

Metric	Type	Description	Labels
`lightdash_s3_results_upload_duration_seconds`	histogram	S3 results upload duration	`source`

Custom event metrics

Lightdash supports operator-configurable Prometheus counter metrics that are driven by application events. These are defined via a JSON configuration file specified by the LIGHTDASH_CUSTOM_METRICS_CONFIG_PATH environment variable. Each entry in the config file creates a counter metric that increments when a matching application event fires. This allows you to track custom business-level metrics such as user logins or query executions without modifying the application code.

Using metrics for monitoring and alerting

You can use these metrics to create dashboards and alerts in your monitoring system. Some common use cases include:

Monitoring memory usage and setting alerts for potential memory leaks
Tracking PostgreSQL connection pool utilization
Monitoring event loop lag to detect performance issues
Setting up alerts for high CPU usage

For example, you might want to create alerts for:

High memory usage: process_resident_memory_bytes > threshold
Event loop lag: nodejs_eventloop_lag_p99_seconds > threshold
Database connection pool saturation: pg_active_connections / pg_pool_max_size > 0.8

OpenTelemetry support

Lightdash metrics are also compatible with OpenTelemetry. You can use the OpenTelemetry Collector with the Prometheus receiver to scrape Lightdash’s Prometheus metrics endpoint and export them to any OpenTelemetry-compatible backend. Example OpenTelemetry Collector configuration:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'lightdash'
          scrape_interval: 15s
          static_configs:
            - targets: ['lightdash:9090']

exporters:
  # Configure your preferred exporter (e.g., OTLP, Jaeger, etc.)
  otlp:
    endpoint: "your-otlp-endpoint:4317"

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [otlp]

Setting up a Prometheus server

If you don’t already have a Prometheus server set up, here are some resources to help you get started:

General Prometheus setup

Prometheus Getting Started Guide - Official documentation on how to install and configure Prometheus
Prometheus Installation - Different ways to install Prometheus
Prometheus Configuration - Detailed configuration options for Prometheus

Setting up Prometheus in Google Cloud Platform (GCP)

Google Cloud Managed Service for Prometheus - Google Cloud’s managed Prometheus service
Installing Prometheus on GKE - Setting up Prometheus on Google Kubernetes Engine
Google Cloud Operations Suite Integration - Integrating Prometheus with Google Cloud Operations Suite

Setting up Prometheus in Amazon Web Services (AWS)

Amazon Managed Service for Prometheus - AWS managed Prometheus service
Getting Started with Amazon Managed Service for Prometheus - Official AWS documentation
Setting up Prometheus on Amazon EKS - Deploying Prometheus on Amazon Elastic Kubernetes Service

Introduction

Explore and analyze

Build with AI

Build your semantic layer

Workspace and user management

Integrations

Lightdash SDK

Embedding

Self-hosting and deployment

Contact

Configure Prometheus metrics for self-hosted Lightdash

Enabling Prometheus metrics

Configuration options

Available metrics

Process metrics

Node.js metrics

PostgreSQL metrics

Queue metrics

Query metrics

Pre-aggregate metrics

AI agent metrics

S3 metrics

Custom event metrics

Using metrics for monitoring and alerting

OpenTelemetry support

Setting up a Prometheus server

General Prometheus setup

Setting up Prometheus in Google Cloud Platform (GCP)

Setting up Prometheus in Amazon Web Services (AWS)

Introduction

Explore and analyze

Build with AI

Build your semantic layer

Workspace and user management

Integrations

Lightdash SDK

Embedding

Self-hosting and deployment

Contact

​Enabling Prometheus metrics

​Configuration options

​Available metrics

​Process metrics

​Node.js metrics

​PostgreSQL metrics

​Queue metrics

​Query metrics

​Pre-aggregate metrics

​AI agent metrics

​S3 metrics

​Custom event metrics

​Using metrics for monitoring and alerting

​OpenTelemetry support

​Setting up a Prometheus server

​General Prometheus setup

​Setting up Prometheus in Google Cloud Platform (GCP)

​Setting up Prometheus in Amazon Web Services (AWS)

Enabling Prometheus metrics

Configuration options

Available metrics

Process metrics

Node.js metrics

PostgreSQL metrics

Queue metrics

Query metrics

Pre-aggregate metrics

AI agent metrics

S3 metrics

Custom event metrics

Using metrics for monitoring and alerting

OpenTelemetry support

Setting up a Prometheus server

General Prometheus setup

Setting up Prometheus in Google Cloud Platform (GCP)

Setting up Prometheus in Amazon Web Services (AWS)