Large language model performance matrix
Serverless Security Stack
This page describes the performance of various large language models (LLMs) for different use cases in Elastic Security, based on our internal testing. To learn more about these use cases, refer to AI-Powered features.
Important
Higher scores indicate better performance. A score of 100 on a task means the model met or exceeded all task-specific benchmarks.
Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.
Models from third-party LLM providers.
| Model | Alerts | ES|QL Query Generation | Knowledge Base Retrieval | Attack Discovery | General Security | Automatic Migration | Average Score |
|---|---|---|---|---|---|---|---|
| GPT 5 Chat | 91 | 92 | 100 | 85 | 92 | 99 | 93 |
| Sonnet 4.5 | 90 | 90 | 100 | 80 | 90 | 100 | 92 |
| GPT 5.1 | 93 | 95 | 100 | 95 | 65 | 98 | 91 |
| Sonnet 3.7 | 89 | 90 | 100 | 70 | 90 | 97 | 89 |
| Elastic Managed LLM | 89 | 90 | 100 | 70 | 90 | 97 | 89 |
| Opus 4.5 | 86 | 86 | 100 | 85 | 90 | 73 | 87 |
| Gemini 2.5 Pro | 89 | 86 | 100 | 87 | 90 | 63 | 86 |
| Opus 4.1 | 92 | 93 | 100 | 70 | 90 | 70 | 86 |
| Sonnet 4 | 89 | 92 | 100 | 70 | 88 | 75 | 86 |
| GPT 4.1 | 87 | 88 | 100 | 80 | 88 | 31 | 79 |
| Gemini 2.5 Flash | 87 | 90 | Not recommended | Not recommended | 90 | Not recommended | 45 |
| Haiku 4.5 | 84 | 80 | Not recommended | Not recommended | 88 | Not recommended | 42 |
Models you can deploy yourself.
| Model | Alerts | ES|QL Query Generation | Knowledge Base Retrieval | Attack Discovery | General Security | Automatic Migration | Average Score |
|---|---|---|---|---|---|---|---|
| GPT OSS 20b | 82 | 25 | Not recommended | Not recommended | 10 | Not recommended | 20 |