Monitor the EDOT Collector with internal metrics
The EDOT Collector exposes internal OpenTelemetry metrics that provide visibility into its health, performance, and telemetry pipeline behavior. Monitoring these metrics can help you proactively detect backpressure, exporter failures, dropped spans, and resource saturation before they impact data ingestion.
The EDOT Collector exposes internal metrics in Prometheus format by default at http://127.0.0.1:8888/metrics. To expose metrics on all interfaces or customize the endpoint, update the service.telemetry.metrics section in your Collector configuration.
service:
telemetry:
metrics:
readers:
- pull:
exporter:
prometheus:
host: '0.0.0.0'
port: 8888
This configuration serves metrics on port 8888 and makes them available to scrape from any network interface.
The exact configuration might vary based on deployment mode and whether metrics are scraped directly or forwarded by another collector or Elastic Agent.
To collect internal metrics, use the EDOT Collector's Prometheus receiver (prometheusreceiver) to scrape the Prometheus endpoint exposed by the Collector. Unlike the metricbeat-style prometheus/metrics input, this contrib, OTLP-native receiver doesn't add ECS fields as metadata.
When running the Collector (including under Elastic Agent), add a Prometheus receiver and a metrics pipeline that scrapes the internal metrics endpoint. For example:
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'otelcol-internal'
static_configs:
- targets: ['127.0.0.1:8888']
metrics_path: /metrics
service:
pipelines:
metrics/internal:
receivers:
- prometheus
exporters:
- otlp
Replace 127.0.0.1:8888 with <collector-host>:8888 if scraping from another host. After ingestion, these metrics are available in Elastic Observability for dashboards, visualizations, and alerting.
The EDOT Collector emits internal metrics under the otelcol.* namespace (refer to the Collector service metadata for more information). However, when you scrape the Prometheus endpoint, metric names are normalized to Prometheus format and appear with the otelcol_* prefix (dots become underscores). Use them to monitor the Collector’s internal state and surface operational issues.
Monitor telemetry flow across pipeline stages:
otelcol_receiver_accepted_spansotelcol_receiver_refused_spansotelcol_receiver_failed_spansotelcol_exporter_sent_spansotelcol_exporter_send_failed_spans
Look for gaps between accepted and sent spans to identify delays or failures.
Monitor queue pressure between processors and exporters:
otelcol_exporter_queue_sizeotelcol_exporter_queue_capacityotelcol_exporter_enqueue_failed_spans
Rising queue sizes or enqueue failures might signal backpressure or telemetry loss.
Track send failures and retry behavior:
otelcol_exporter_send_failed_spansotelcol_exporter_send_failed_metric_pointsotelcol_exporter_send_failed_log_records
High failure counts might result from network errors, invalid credentials, or backend throttling. Exporters might retry failed sends automatically, so these metrics don't always indicate data loss.
Monitor the Collector's resource utilization:
otelcol_process_memory_rssotelcol_process_cpu_secondsotelcol_runtime_num_goroutines
High or growing values can indicate memory leaks, inefficient configuration, or excessive load.
The following patterns help identify and resolve common Collector performance issues.
Symptoms:
- Queue size increases over time
- Enqueue failures or dropped spans
Causes:
- Backend slowness or outages
- Exporter throughput limits
- Insufficient Collector resources
Resolution:
- Check exporter health and credentials
- Tune queue and batch settings
- Scale the Collector instance or deployment
For more information, refer to Export failures when sending telemetry data (sending_queue overflow, exporter timeouts), 429 errors when using the mOTLP endpoint (rate limiting and backpressure).
Symptoms:
- Elevated
*_send_failed_*metrics - Growing retry queues
Causes:
- Network issues or timeouts
- Backend rate limiting
- Misconfigured authentication
Resolution:
- Verify backend availability and credentials
- Review ingestion limits and retry logic
- Investigate latency or firewall constraints
For more information, refer to Export failures when sending telemetry data (export failures, retries), 429 errors when using the mOTLP endpoint (rate limiting), Connectivity issues with EDOT (network, authorization, firewall).
Symptoms:
- Rising memory RSS
- Sustained high CPU usage
- Increasing goroutine count
Causes:
- High-volume telemetry ingestion
- Inefficient processor configurations
- Memory leaks in custom components
Resolution:
- Adjust sampling or processing logic
- Increase resource limits
- Horizontally scale Collector instances
For more information, refer to Collector out of memory (OOM errors, memory exhaustion), Insufficient resources in Kubernetes (resource limits, scaling).
Use internal metrics to create dashboards and alerting rules. Track real-time pipeline health and detect regressions early.
Example alert scenarios:
- Exporter queue usage exceeds 80% for more than 5 minutes
- Send failure rate exceeds a defined threshold
- Dropped spans exceed a historical baseline