Large language model performance matrix for Elastic Security

This page summarizes internal test results comparing large language models (LLMs) across Elastic Security AI chat and AI-powered feature use cases. These ratings apply equally whether you're using AI Assistant or Agent Builder. To learn more about these use cases, refer to AI-powered features.

Important

Higher scores indicate better performance. A score of 10 on a task means the model met or exceeded all task-specific benchmarks.

Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.

Proprietary models

Models from third-party LLM providers.

Scroll horizontally to view more information.

Model	Alert Triage	Detection Engineering	Investigation	KB Retrieval	Workflow Execution	Overall
Claude Sonnet 4.6	10	4.88	6.44	6.26	10	7.52
Claude Opus 4.6	10	4.31	6.58	6.41	9.71	7.4
Gemini 3.1 Pro	10	4.69	6.21	6.02	9.62	7.31
GPT-5.4	10	4.41	6.83	6.67	8.6	7.3
Gemini 3.0 Flash	8.43	4.09	5.71	5.49	9.14	6.57

Open-source models

Models you can deploy yourself.

Scroll horizontally to view more information.

Model	Alert Triage	Detection Engineering	Investigation	KB Retrieval	Workflow Execution	Overall
GPT OSS 120B	7.31	1.81	6.94	6.79	5.17	5.6