Large language model performance matrix for Elastic Security
This page summarizes internal test results comparing large language models (LLMs) across Elastic Security AI chat and AI-powered feature use cases. These ratings apply equally whether you're using AI Assistant or Agent Builder. To learn more about these use cases, refer to AI-powered features.
Important
Higher scores indicate better performance. A score of 10 on a task means the model met or exceeded all task-specific benchmarks.
Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.
Models from third-party LLM providers.
| Model | Alerts | Security Knowledge | ES|QL Query Generation | Knowledge Base Retrieval | Attack Discovery | Automatic Migration | Average Score |
|---|---|---|---|---|---|---|---|
| Opus 4.6 | 8.9 | 9.5 | 8.5 | 8.42 | 8.7 | 10 | 9 |
| Sonnet 4.5 | 8.6 | 7.6 | 7.7 | 7.23 | 8 | 10 | 8.19 |
| Opus 4.5 | 9 | 8.2 | 7.5 | 7.94 | 8.5 | 7.3 | 8.07 |
| GPT 5.2 | 8.6 | 6.6 | 8 | 6 | 8.5 | 10 | 7.95 |
| Sonnet 4 | 7.5 | 7.4 | 8 | 7.85 | 7 | 7.5 | 7.54 |
| Sonnet 4.6 | 9.3 | 9.5 | 8.4 | 7.45 | Not recommended | 10 | 7.44 |
| Sonnet 3.7 | 7.4 | 6.9 | 6.1 | 7.04 | 7 | 9.7 | 7.36 |
| GPT 5.1 | 9.3 | 4.3 | 7.2 | 6 | 6.5 | 9.8 | 7.18 |
| GPT 4.1 Mini | 6.5 | 6.4 | 6 | 6.96 | 4.5 | 9.9 | 6.71 |
| Gemini 2.5 Flash | 7.8 | 6.2 | 4.4 | 5.71 | 6 | 9.81 | 6.65 |
| Gemini 2.5 Pro | 8 | 5.6 | 1.9 | 5.3 | 8.7 | 6.3 | 5.97 |
| GPT 4.1 | 7.4 | 5.7 | 4.4 | 5.85 | 8 | 3.1 | 5.74 |
Models you can deploy yourself.
| Model | Alerts | Security Knowledge | ES|QL Query Generation | Knowledge Base Retrieval | Attack Discovery | Automatic Migration | Average Score |
|---|---|---|---|---|---|---|---|
| GPT OSS 120B | 7.6 | 3.7 | 5.5 | 6 | 3.5 | 9.7 | 6 |
| GPT OSS 20b | 8.2 | 1.5 | 2.5 | Not recommended | Not recommended | Not recommended | 2.03 |