Loading

Large language model performance matrix for Elastic Security

This page summarizes internal test results comparing large language models (LLMs) across Elastic Security AI chat and AI-powered feature use cases. These ratings apply equally whether you're using AI Assistant or Agent Builder. To learn more about these use cases, refer to AI-powered features.

Important

Higher scores indicate better performance. A score of 10 on a task means the model met or exceeded all task-specific benchmarks.

Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.

Models from third-party LLM providers.

Scroll horizontally to view more information.
Model Alert Triage Detection Engineering Investigation KB Retrieval Workflow Execution Overall
Claude Sonnet 4.6 10 4.88 6.44 6.26 10 7.52
Claude Opus 4.6 10 4.31 6.58 6.41 9.71 7.4
Gemini 3.1 Pro 10 4.69 6.21 6.02 9.62 7.31
GPT-5.4 10 4.41 6.83 6.67 8.6 7.3
Gemini 3.0 Flash 8.43 4.09 5.71 5.49 9.14 6.57

Models you can deploy yourself.

Scroll horizontally to view more information.
Model Alert Triage Detection Engineering Investigation KB Retrieval Workflow Execution Overall
GPT OSS 120B 7.31 1.81 6.94 6.79 5.17 5.6