Loading

Monitor the EDOT Collector with internal metrics

The EDOT Collector exposes internal OpenTelemetry metrics that provide visibility into its health, performance, and telemetry pipeline behavior. Monitoring these metrics can help you proactively detect backpressure, exporter failures, dropped spans, and resource saturation before they impact data ingestion.

The EDOT Collector exposes internal metrics in Prometheus format by default at http://127.0.0.1:8888/metrics. To expose metrics on all interfaces or customize the endpoint, update the service.telemetry.metrics section in your Collector configuration.

service:
  telemetry:
    metrics:
      readers:
        - pull:
            exporter:
              prometheus:
                host: '0.0.0.0'
                port: 8888
		

This configuration serves metrics on port 8888 and makes them available to scrape from any network interface.

Note

The exact configuration might vary based on deployment mode and whether metrics are scraped directly or forwarded by another collector or Elastic Agent.

To collect internal metrics, use the EDOT Collector's Prometheus receiver (prometheusreceiver) to scrape the Prometheus endpoint exposed by the Collector. Unlike the metricbeat-style prometheus/metrics input, this contrib, OTLP-native receiver doesn't add ECS fields as metadata.

When running the Collector (including under Elastic Agent), add a Prometheus receiver and a metrics pipeline that scrapes the internal metrics endpoint. For example:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otelcol-internal'
          static_configs:
            - targets: ['127.0.0.1:8888']
          metrics_path: /metrics

service:
  pipelines:
    metrics/internal:
      receivers:
        - prometheus
      exporters:
        - otlp
		

Replace 127.0.0.1:8888 with <collector-host>:8888 if scraping from another host. After ingestion, these metrics are available in Elastic Observability for dashboards, visualizations, and alerting.

The EDOT Collector emits internal metrics under the otelcol.* namespace (refer to the Collector service metadata for more information). However, when you scrape the Prometheus endpoint, metric names are normalized to Prometheus format and appear with the otelcol_* prefix (dots become underscores). Use them to monitor the Collector’s internal state and surface operational issues.

Monitor telemetry flow across pipeline stages:

  • otelcol_receiver_accepted_spans
  • otelcol_receiver_refused_spans
  • otelcol_receiver_failed_spans
  • otelcol_exporter_sent_spans
  • otelcol_exporter_send_failed_spans

Look for gaps between accepted and sent spans to identify delays or failures.

Monitor queue pressure between processors and exporters:

  • otelcol_exporter_queue_size
  • otelcol_exporter_queue_capacity
  • otelcol_exporter_enqueue_failed_spans

Rising queue sizes or enqueue failures might signal backpressure or telemetry loss.

Track send failures and retry behavior:

  • otelcol_exporter_send_failed_spans
  • otelcol_exporter_send_failed_metric_points
  • otelcol_exporter_send_failed_log_records

High failure counts might result from network errors, invalid credentials, or backend throttling. Exporters might retry failed sends automatically, so these metrics don't always indicate data loss.

Monitor the Collector's resource utilization:

  • otelcol_process_memory_rss
  • otelcol_process_cpu_seconds
  • otelcol_runtime_num_goroutines

High or growing values can indicate memory leaks, inefficient configuration, or excessive load.

The following patterns help identify and resolve common Collector performance issues.

Symptoms:

  • Queue size increases over time
  • Enqueue failures or dropped spans

Causes:

  • Backend slowness or outages
  • Exporter throughput limits
  • Insufficient Collector resources

Resolution:

  • Check exporter health and credentials
  • Tune queue and batch settings
  • Scale the Collector instance or deployment

For more information, refer to Export failures when sending telemetry data (sending_queue overflow, exporter timeouts), 429 errors when using the mOTLP endpoint (rate limiting and backpressure).

Symptoms:

  • Elevated *_send_failed_* metrics
  • Growing retry queues

Causes:

  • Network issues or timeouts
  • Backend rate limiting
  • Misconfigured authentication

Resolution:

  • Verify backend availability and credentials
  • Review ingestion limits and retry logic
  • Investigate latency or firewall constraints

For more information, refer to Export failures when sending telemetry data (export failures, retries), 429 errors when using the mOTLP endpoint (rate limiting), Connectivity issues with EDOT (network, authorization, firewall).

Symptoms:

  • Rising memory RSS
  • Sustained high CPU usage
  • Increasing goroutine count

Causes:

  • High-volume telemetry ingestion
  • Inefficient processor configurations
  • Memory leaks in custom components

Resolution:

  • Adjust sampling or processing logic
  • Increase resource limits
  • Horizontally scale Collector instances

For more information, refer to Collector out of memory (OOM errors, memory exhaustion), Insufficient resources in Kubernetes (resource limits, scaling).

Use internal metrics to create dashboards and alerting rules. Track real-time pipeline health and detect regressions early.

Example alert scenarios:

  • Exporter queue usage exceeds 80% for more than 5 minutes
  • Send failure rate exceeds a defined threshold
  • Dropped spans exceed a historical baseline