---
title: Large language model performance matrix for Observability
description: This page summarizes internal test results comparing large language models (LLMs) across Observability AI chat use cases. These ratings apply equally...
url: https://www.elastic.co/elastic/docs-builder/docs/3028/solutions/observability/ai/llm-performance-matrix
products:
  - Elastic Observability
applies_to:
  - Elastic Cloud Serverless: Generally available
  - Elastic Stack: Generally available since 9.2
---

# Large language model performance matrix for Observability
This page summarizes internal test results comparing large language models (LLMs) across Observability [AI chat](https://www.elastic.co/elastic/docs-builder/docs/3028/explore-analyze/ai-features/ai-chat-experiences) use cases. These ratings apply equally whether you're using [AI Assistant](https://www.elastic.co/elastic/docs-builder/docs/3028/solutions/observability/ai/observability-ai-assistant) or [Agent Builder](https://www.elastic.co/elastic/docs-builder/docs/3028/solutions/observability/ai/agent-builder-observability).
<important>
  Rating legend:**Excellent:** Highly accurate and reliable for the use case.
  **Great:** Strong performance with minor limitations.
  **Good:** Possibly adequate for many use cases but with noticeable tradeoffs.
  **Poor:** Significant issues; not recommended for production for the use case.Recommended models are those rated **Excellent** or **Great** for the particular use case.
</important>


## Proprietary models

Models from third-party LLM providers.

| Provider       | Model                 | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **ES|QL generation** | **Execute connector** | **Knowledge retrieval** |
|----------------|-----------------------|---------------------|-------------------|-------------------------|-----------------------------|------------------------------|----------------------|-----------------------|-------------------------|
| Amazon Bedrock | **Claude Sonnet 3.5** | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Good                  | Excellent               |
| Amazon Bedrock | **Claude Sonnet 3.7** | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Great                 | Excellent               |
| Amazon Bedrock | **Claude Sonnet 4**   | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Great                 | Excellent               |
| Amazon Bedrock | **Claude Sonnet 4.5** | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Good                  | Excellent               |
| Google Gemini  | **Gemini 2.0 Flash**  | Excellent           | Good              | Excellent               | Excellent                   | Excellent                    | Good                 | Good                  | Excellent               |
| Google Gemini  | **Gemini 2.5 Flash**  | Excellent           | Good              | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| Google Gemini  | **Gemini 2.5 Pro**    | Excellent           | Great             | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| OpenAI         | **GPT-4.1**           | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| OpenAI         | **GPT-4.1 Mini**      | Excellent           | Great             | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |
| OpenAI         | **GPT-5**             | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Excellent            | Good                  | Excellent               |
| OpenAI         | **GPT-5.2**           | Excellent           | Great             | Excellent               | Excellent                   | Excellent                    | Great                | Good                  | Excellent               |


## Open-source models

<applies-to>
  - Elastic Cloud Serverless: Preview
  - Elastic Stack: Preview since 9.2
</applies-to>

Models you can [deploy and manage yourself](https://www.elastic.co/elastic/docs-builder/docs/3028/explore-analyze/ai-features/llm-guides/connect-to-lmstudio-observability).

| Provider        | Model                                   | **Alert questions** | **APM questions** | **Contextual insights** | **Documentation retrieval** | **Elasticsearch operations** | **ES|QL generation** | **Execute connector** | **Knowledge retrieval** |
|-----------------|-----------------------------------------|---------------------|-------------------|-------------------------|-----------------------------|------------------------------|----------------------|-----------------------|-------------------------|
| DeepSeek        | **DeepSeek-V3.1**                       | Excellent           | Excellent         | Excellent               | Excellent                   | Excellent                    | Great                | Great                 | Excellent               |
| Google DeepMind | **Gemma-3-27b-it**                      | Excellent           | Good              | Great                   | Great                       | Excellent                    | Good                 | Great                 | Excellent               |
| OpenAI          | **gpt-oss-20b**                         | Poor                | Poor              | Great                   | Poor                        | Good                         | Poor                 | Good                  | Good                    |
| OpenAI          | **gpt-oss-120b**                        | Excellent           | Poor              | Great                   | Great                       | Excellent                    | Good                 | Good                  | Excellent               |
| Meta            | **Llama-3.3-70B-Instruct**              | Excellent           | Good              | Great                   | Excellent                   | Excellent                    | Good                 | Good                  | Excellent               |
| Meta            | **Llama-4-Maverick-17B-128E-Instruct**  | Great               | Good              | Great                   | Excellent                   | Excellent                    | Good                 | Good                  | Great                   |
| Mistral         | **Mistral-Small-3.2-24B-Instruct-2506** | Excellent           | Poor              | Great                   | Great                       | Excellent                    | Good                 | Good                  | Excellent               |
| Alibaba Cloud   | **Qwen2.5-72b-Instruct**                | Excellent           | Good              | Great                   | Excellent                   | Excellent                    | Good                 | Good                  | Excellent               |

<note>
  `Llama-3.3-70B-Instruct` and `Qwen2.5-72b-Instruct` were tested with simulated function calling.
</note>


## Evaluate your own model

You can run the Observability AI evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the [evaluation framework README](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md) for setup and usage details.
For consistency, all ratings in this matrix were generated using `Gemini 2.5 Pro` as the judge model (specified through the `--evaluateWith` flag). Use the same judge when evaluating your own model to ensure comparable results.