Large language model performance matrix for Elastic Security
This page summarizes internal test results comparing large language models (LLMs) across Elastic Security AI chat and AI-powered feature use cases. These ratings apply equally whether you're using AI Assistant or Agent Builder. To learn more about these use cases, refer to AI-powered features.
Important
Higher scores indicate better performance. A score of 10 on a task means the model met or exceeded all task-specific benchmarks.
Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.
Models from third-party LLM providers.
| Model | Alert Triage | Detection Engineering | Investigation | KB Retrieval | Workflow Execution | Overall |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 10 | 4.88 | 6.44 | 6.26 | 10 | 7.52 |
| Claude Opus 4.6 | 10 | 4.31 | 6.58 | 6.41 | 9.71 | 7.4 |
| Gemini 3.1 Pro | 10 | 4.69 | 6.21 | 6.02 | 9.62 | 7.31 |
| GPT-5.4 | 10 | 4.41 | 6.83 | 6.67 | 8.6 | 7.3 |
| Gemini 3.0 Flash | 8.43 | 4.09 | 5.71 | 5.49 | 9.14 | 6.57 |
Models you can deploy yourself.
| Model | Alert Triage | Detection Engineering | Investigation | KB Retrieval | Workflow Execution | Overall |
|---|---|---|---|---|---|---|
| GPT OSS 120B | 7.31 | 1.81 | 6.94 | 6.79 | 5.17 | 5.6 |