﻿---
title: Operate the Universal Profiling backend
description: This page outlines operating the backend when running Universal Profiling on a self-managed version of the Elastic Stack. Here you’ll find information...
url: https://www.elastic.co/elastic/docs-builder/docs/3028/solutions/observability/infra-and-hosts/operate-universal-profiling-backend
products:
  - Elastic Observability
applies_to:
  - Elastic Stack: Generally available
---

# Operate the Universal Profiling backend
This page outlines operating the backend when running Universal Profiling on a self-managed version of the Elastic Stack. Here you’ll find information on:
- [Resource sizing](#profiling-self-managed-ops-sizing-guidance)
- [Configuring your collector and symbolizer](#profiling-self-managed-ops-configuration)
- [Monitoring your collector and symbolizer](#profiling-self-managed-ops-monitoring)
- [Scaling your resources](#profiling-scaling-backend-resources)
- [Upgrading backend binaries](#profiling-self-managed-upgrade)
- [Kubernetes tips](#profiling-self-managed-kubernetes-tips)


## Resource guide

The resources needed to ingest and query Universal Profiling data vary based on the total number of CPU cores you’re profiling. The number of cores comes from the sum of all *virtual* cores as recorded in `/proc/cpuinfo`, adding up all the machines you’ll deploy the Universal Profiling Agent to.
Ingestion and query resource demand is almost directly proportional to the amount of data the Universal Profiling Agents generate. Calculate the data generated by the Universal Profiling Agents using the number of CPU samples collected, the number of executables processed, and the executables' debug metadata size. While the number of CPU samples collected is predictable, the number of executables processed and the executables' debug metadata size is not.
The following table provides recommended resources for ingesting and querying Universal Profiling data based on your number of CPU cores:

| # of CPU cores | Elasticsearch total memory | Elasticsearch total storage (60 days retention) | Profiling Backend                  | Kibana memory |
|----------------|----------------------------|-------------------------------------------------|------------------------------------|---------------|
| 1–100          | 4GB–8GB                    | 250GB                                           | 1 Collector 2GB, 1 Symbolizer 2GB  | 2GB           |
| 100–1000       | 8GB–32GB                   | 250GB–2TB                                       | 1 Collector 4GB, 1 Symbolizer 4GB  | 2GB           |
| 1000–10,000    | 32GB–128GB                 | 2TB–8TB                                         | 2 Collector 4GB, 1 Symbolizer 8GB  | 4GB           |
| 10,000–50,000  | 128GB–512GB                | 8TB–16TB                                        | 3+ Collector 4GB, 1 Symbolizer 8GB | 8GB           |

<note>
  This table is derived from benchmarks performed on Universal Profiling with ingestion of up to 15,000 CPU cores. The profiled machines had a near-constant load of 75% CPU utilization. The deployment used 3 Elasticsearch nodes with 64 GB memory each, 8 vCPU, and 1.5 TB NVMe disk drives.
</note>

Resource demand is nearly proportional to the amount of data the Universal Profiling Agents generate. Therefore, you can calculate the necessary resources for use cases beyond those in the table by comparing your actual number of cores profiled with the number of cores in the table. When calculating, consider the following:
- The average load of the machines being profiled: The average load directly impacts the amount of CPU samples collected. For example, on a system that is mostly idle, not all CPUs will be scheduling tasks during the sampling intervals.
- The rate of change of the executables being profiled—for example, how often you deploy new versions of your software: The rate of change impacts the amount of debug metadata stored in Elasticsearch as a result of symbolization; the more different executables the Universal Profiling Agent collects, the more debug data will be stored in Elasticsearch. Note that two different builds of the same application still result in two different executables, as the Universal Profiling Agent will treat each ELF file independently.

Storage considerations: the Elasticsearch disks' bandwidth and latency will affect the latency of ingesting and querying the profiling data. Allocate data to hot nodes for best performance and user experience. If storage becomes a concern, tune the data retention by customizing the Universal Profiling [index lifecycle management policy](/elastic/docs-builder/docs/3028/solutions/observability/infra-and-hosts/universal-profiling-index-life-cycle-management#profiling-ilm-custom-policy).

## Configure the collector and symbolizer

You can configure the collector and symbolizer using the YAML file and CLI flags, with the CLI flags taking precedence over the YAML file. The configuration files are created during the installation process, as seen in [Create configuration files section](/elastic/docs-builder/docs/3028/solutions/observability/infra-and-hosts/step-4-run-backend-applications#profiling-self-managed-running-linux-configfile). Comments in the configuration files explain the purpose of each configuration option.
Restart the backend binaries after modifying the configuration files for changes to take effect.

### Use CLI flags to override configuration file values

When building configuration options for each of the backend binaries, you can use CLI flags to override the values in the YAML configuration file. The overrides **must** contain the full path to the configuration option and must be in a key=value format. For example, `-E application.field.key=value`, where `application` is the name of the binary.
For example, to enable TLS in the HTTP server of the collector, you can pass the `-E pf-elastic-collector.ssl.enabled=true` flag. This will override the `ssl.enabled` option found in the YAML configuration file.

## Monitoring

Monitor the collector and symbolizer through [Logs](#profiling-self-managed-ops-monitoring-logs) and [Metrics](#profiling-self-managed-ops-monitoring-metrics) to ensure the services are running and healthy. Without both services running, profiling data will not be ingested and symbolized, and querying Kibana won’t return data.

### Logs

The collector and symbolizer always log to standard output. You can turn on debug logs by setting the `verbose` configuration option to `true` in the YAML configuration file.
Avoid using debug logs in production, as they can be very verbose and impact backend performance. Only enable debug logs when troubleshooting a failed deployment or when instructed to do so by support.
Logs are formatted as "key=value" pairs, and Elasticsearch and Kibana can automatically parse them into fields.
A log collector, such as Filebeat, can collect and send logs to Elasticsearch for indexing and analysis. Depending on how it’s installed, a Filebeat input of type `journald` (for OS packages), `log` (for binaries), or `container` can be used to process the logs. Refer to the [filebeat documentation](https://docs-v3-preview.elastic.dev/elastic/docs-builder/docs/3028/reference/beats/filebeat/configuring-howto-filebeat) for more information.

### Metrics

Metrics are not exposed by default. Enable metrics in the `metrics` section in the YAML configuration files. The collector and symbolizer can expose metrics in both JSON and Prometheus formats.
Metrics in JSON format can be exposed through an HTTP server or a Unix domain socket. Prometheus metrics can only be exposed through an HTTP server. Customize where the metrics are exposed using the `metrics.prometheus_host` and `metrics.expvar_host` configuration options.
You can use Metricbeat to scrape metrics. Consume the JSON directly through the `http` module. Consume the Prometheus endpoint using the `prometheus` module. When using an HTTP server for either format, the URI to scrape metrics from is `/metrics`.
For example, the following collector configuration would expose metrics in Prometheus format on port 9090 and in JSON format on port 9191. You can then scrape them by connecting to `http://127.0.0.1:9090/metrics` and `http://127.0.0.1:9191/metrics` respectively.
```yaml
pf-elastic-collector:
  metrics:
    prometheus_host: ":9090"
    expvar_host: ":9191"
```

Optionally, you can also expose the `expvar` format over a Unix domain socket, by setting the `expvar_socket` configuration option to a valid path. For example, the following collector configuration would expose metrics in Prometheus format on port 9090 and in JSON format over a Unix domain socket at `/tmp/collector.sock`.
```yaml
pf-elastic-collector:
  metrics:
    prometheus_host: ":9090"
    expvar_host: "/tmp/collector.sock"
```

The following sections show the most relevant metrics exposed by the backend binaries. Include these metrics in your monitoring dashboards to detect backend issues.
**Common runtime metrics**
- `process_cpu_seconds_total`: track the amount of CPU time used by the process.
- `process_resident_memory_bytes`: track the amount of RAM used by the process.
- `go_memstats_heap_sys_bytes`: track the amount of heap memory.
- `go_memstats_stack_sys_bytes`: track the amount of stack memory.
- `go_threads`: number of OS threads created by the runtime.
- `go_goroutines`: number of active goroutines.

**Collector metrics**
- `collection_agent.indexing.bulk_indexer_failure_count`: number of times the bulk indexer failed to ingest data in Elasticsearch.
- `collection_agent.indexing.document_count.*`: counter that represents the number of documents ingested in Elasticsearch for each index; can be used to calculate the rate of ingestion for each index.
- `grpc_server_handling_seconds`: histogram of the time spent by the gRPC server to handle requests.
- `grpc_server_msg_received_total`: count of messages received by the gRPC server; can be used to calculate the rate of ingestion for each RPC.
- `grpc_server_handled_total`: count of messages processed by the gRPC server; can be used to calculate the availability of the gRPC server for each RPC.

**Symbolizer metrics**
- `symbols_app.indexing.bulk_indexer_failure_count`: number of times the bulk indexer failed to ingest data in Elasticsearch.
- `symbols_app.indexing.document_count.*`: counter that represents the number of documents ingested in Elasticsearch for each index; can be used to calculate the rate of ingestion for each index.
- `symbols_app.user_client.document_count.update.*`: counter that represents the number of existing documents that were updated in Elasticsearch for each index; when the rate increases, it can impact Elasticsearch performance.

**Health checks**
The backend binaries expose two health check endpoints, `/live` and `/ready`, that you can use to monitor the health of the application. The endpoints return a `200 OK` HTTP status code when the checks are successful.
The health check endpoints are hosted in the same HTTP server that accepts the incoming profiling data. This endpoint is configured through the application’s `host` configuration option.
For example, if the collector is configured with the default value `host: 0.0.0.0:8260`, you can check the health of the application by running `curl -i localhost:8260/live` and `curl -i localhost:8260/ready`.

## Profiling agent telemetry data

The Universal Profiling collector receives telemetry data from profiling agents to help debug agent operations and gather product usage statistics. This data enables us to understand the demographics of profiled machines and aids in investigations when customers report issues.
By default, telemetry data collected by all profiling agents is sent to the collector, which then forwards it to an Elastic endpoint over the internet. You can opt out by configuring the `agent_metrics` stanza in the collector configuration YAML file. If you opt out, any troubleshooting by customer service will require manually extracting and providing the telemetry data.
The ["Collector configuration file"](/elastic/docs-builder/docs/3028/solutions/observability/infra-and-hosts/step-4-run-backend-applications#profiling-self-managed-running-linux-configfile-collector) allows you to configure whether telemetry data is forwarded to Elastic, stored internally, or discarded. If the collector is deployed in a network without internet access, telemetry data will not be forwarded to Elastic.
Below are examples of configurations for each option:
**Forward telemetry data to Elastic**
This is the default configuration. When the `agent_metrics` stanza is absent from the collector configuration file, the collector forwards telemetry data to Elastic.
Explicitly enabling it does not alter the default behavior.
```yaml
agent_metrics:
  disabled: false
```

**Collect telemetry data internally and send it to Elastic**
In this configuration, telemetry data from profiling agents is sent to Elastic **and** stored internally. The data is stored in the `profiling-metrics*` indices within the same Elasticsearch cluster as Universal Profiling data and follows the same data retention policies.
```yaml
agent_metrics:
  disabled: false
  write_all: true
```

**Collect telemetry data only internally**
To collect telemetry data without forwarding it to Elastic, configure the collector to store the data internally. You can specify the Elasticsearch deployment for storing telemetry data by providing a list of Elasticsearch hosts and an API key for authentication.
```yaml
agent_metrics:
  disabled: false
  addresses: ["https://internal-telemetry-endpoint.es.company.org:9200"]
  api_key: "internal-telemetry-api-key"
```

**Disable telemetry data collection entirely**
```yaml
agent_metrics:
  disabled: true
```


## Scale resources

In the [resource guidance table](#profiling-self-managed-ops-sizing-guidance), no options use more than one replica for the symbolizer. This is due to how multiple symbolizer replicas have to synchronize over identical frame records to be processed. While it still possible to scale horizontally, we recommend scaling the symbolizer vertically first, by increasing the memory and CPU cores it uses to process data.
You can increase the number of collector replicas at will, keeping their vertical sizing smaller if this is more convenient for your deployment use case. The collector has a linear increase in memory usage and CPU threads with the number of Universal Profiling Agents that it serves. Keep in mind that since the Universal Profiling Agent/collector communication happens via gRPC, there may be long-lived TCP sessions that are bound to a single collector replica. When scaling out the number of replicas, depending on the load balancer that you have in place fronting the collector’s endpoint, you may want to shut down the older replicas after adding new replicas. This ensures that the load is evenly distributed across all replicas.

## Upgrade a self-hosted stack

Upgrading a self-hosted stack involves upgrading the backend applications and the agent. We recommend upgrading the backend first, followed by the agent. This way, if you encounter problems with the backend, you can roll back to the previous version without needing to downgrade the agent.
<note>
  We recommend having the same version of the agent and the backend deployed.
</note>

We strive to maintain backward compatibility between minor versions. Occasionally, changes to the data format may require having the same version of the agent and backend deployed. When a breaking change in the protocol is introduced, the profiling agents that are not up to date will stop sending data. The agent logs will report an error message indicating that the backend is not compatible with the agent (or vice versa).
The upgrade process steps vary depending on the installation method used. You may have a combination of installation methods. For example, you might deploy the backend on ECE and the agents on Kubernetes. In that case, refer to the specific sections (backend/agent) in each method.
<important>
  Depending on your infrastructure setup, upgrading the backend may also update the endpoint exposed by the collector. In this case, amend the agent configuration to connect to the new endpoint upon upgrade.
</important>


### ECE

When using ECE, the upgrade process of the backend is part of the installation of a new ECE release. You don’t need to perform any action to upgrade the backend applications, as they will be upgraded automatically.
For the agent deployment, you can upgrade the Fleet integration installed on the Elastic Agent if that’s how you’re deploying the agent.

### ECK or generic Kubernetes

Perform a helm upgrade of the backend charts using the `helm upgrade` command. You may reuse existing values or provide the full values YAML file on each upgrade.
For the agent deployment, upgrading through the Helm chart is also the simplest option.
<important>
  starting with version 8.15 the agent Helm chart has been renamed from `pf-host-agent` to `profiling-agent`.
</important>

When **upgrading to 8.15 from 8.14 or lower**, follow these additional instructions:
1. Fetch the currently applied Helm values:
   ```
   helm -n universal-profiling get values pf-host-agent -oyaml > profiling-agent-values.yaml
   ```
2. Update the repo to find the new chart:
   ```
   helm repo update
   ```
3. Uninstall the old chart:
   ```
   helm -n uninstall pf-host-agent
   ```
4. Install the new chart by following the instructions displayed in the Universal Profiling "Add Data" page or with the following command:
   ```
   helm install -n universal-profiling universal-profiling-agent elastic/profiling-agent -f profiling-agent-values.yaml
   ```


### OS packages

Upgrade the package version using the OS package manager. You will find the name and links to the new packages in the "Add Data" page.
Not all package managers will call into `systemd` to restart the service, so you may need to restart the service manually or through any other automation in place.

### Binaries

Download the corresponding binary version and replace the existing one, using the command seen in the [Binary](/elastic/docs-builder/docs/3028/solutions/observability/infra-and-hosts/step-4-run-backend-applications#profiling-self-managed-running-linux-binary) section of the setup guide. Replace the old binary and restart the services.
You will find the links to the new binaries in the "Add Data" page, under the "Binary" tab.

## Kubernetes tips

When deploying the Universal Profiling backend on Kubernetes, there are some best practices to follow.

### Ingress configuration

If you are using an ingress controller, the connection routing to the collector Service should be configured to use the gRPC protocol.
We provide an `Ingress` resource as part of the Helm chart. Because the ingress can be any implementation, you must configure the controller with a class name and any necessary annotations using the `ingress.annotations` field.
For example, when using an NGINX ingress controller, set the annotation `nginx.ingress.kubernetes.io/backend-protocol: "GRPC"`, as shown in the following example:
```yaml
ingress:
  create: true
  ingressClassName: "nginx"
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
```

For symbolizer, the connection routing should be configured to use the HTTP protocol. There is usually no need to customize annotations for this type of service, but the chart provides similar configuration options.

### Input TLS configuration

Terminating the TLS connection is not currently supported at the application level, even if the `pf-elastic-collector` and `pf-elastic-symbolizer` configurations include an `ssl` section.
Instead, you should use an ingress-controller to terminate TLS connections and forward unencrypted traffic to the backend services.
To enable TLS termination, configure the `tls` section in the `ingress` resource, as shown in the previous section.
Both the collector and symbolizer Helm charts support an `ingress.tls` section, which lets you specify the TLS secret name and hosts that the certificate should be used for.
We recommend using a certificate manager like [cert-manager](https://cert-manager.io/) to automate certificate provisioning and renewal for ingress resources.
Refer to the [Kubernetes Ingress documentation](https://kubernetes.github.io/ingress-nginx/user-guide/tls/#tlshttps) for an example of how to configure TLS termination with NGINX ingress controller.
In general, the steps are:
1. Store your TLS certificate in a Kubernetes secret in the same namespace as the collector and/or symbolizer.
   ```bash
   kubectl -n universal-profiling create secret tls my-tls-secret --cert=path/to/cert.pem --key=path/to/key.pem
   ```
2. Configure the `ingress.tls` section in the Helm values file used to run the backend applications, for example:
   ```yaml
   ingress:
     <other configs...>
     tls:
       - secretName: my-tls-secret
         hosts:
           - my-host.com
   ```
3. Deploy the charts using `helm upgrade` and passing in the updated values files.


### Output TLS configuration

You can secure the communication between the Universal Profiling backend and the Elasticsearch cluster by enabling TLS in the `output.elasticsearch` section of the collector and symbolizer configuration files.
To do so, Kubernetes secrets containing the TLS key pairs should be provisioned in the namespace where the backend is installed. In case of self-signed certificates, the CA bundle used to validate Elasticsearch’s certificates should also be part of the secret.
Create two secrets, one for the collector and one for the symbolizer, with the names `pf-symbolizer-tls-certificate` and `pf-collector-tls-certificate`. The secrets should contain the following keys:
- `tls.key`: the certificate private key
- `tls.cert`: the certificate public key
- `ca.cert` (optional): the certificate CA bundle

Follow these steps to enable TLS connection from collector/symbolizer to Elasticsearch:
1. Create secrets with the TLS key pairs (omit the `ca.pem` field if you are not using a self-signed CA):
   ```bash
   kubectl -n universal-profiling create secret generic pf-collector-tls-certificate --from-file=tls.key=/path/to/key.pem \
   --from-file=tls.cert=/path/to/cert.pem --from-file=ca.pem=/path/to/ca.crt
   ```
   ```bash
   kubectl -n universal-profiling create secret generic pf-symbolizer-tls-certificate --from-file=tls.key=/path/to/key.pem \
   --from-file=tls.cert=/path/to/cert.pem --from-file=ca.pem=/path/to/ca.crt
   ```
2. Update the collector and symbolizer Helm values files to enable the use of TLS configuration, uncommenting the `output.elasticsearch.ssl` section:
   ```yaml
   output:
     elasticsearch:
       ssl:
         enabled: true
   ```
3. Upgrade the charts using the `helm upgrade` command, providing the updated values file.


### Horizontal scaling

When scaling the Universal Profiling backend on Kubernetes, you can increase the number of replicas for the collector, or enable Horizontal Pod Autoscaling V2.
To enable HPAv2 for the collector or symbolizer, you can set the `autoscalingV2` dictionary in each Helm values file.
At the moment, **it is not recommended to enable an autoscaler for symbolizer**. Due to a current limitation on how symbolizer replicas can synchronize their workloads, it is best to only use a single replica for the symbolizer. Scale the symbolizer vertically first. Only in case of high latency in symbolizing native frames (10+ minutes) you can evaluate adding more replicas.