Large language model performance matrix for Elastic Security

This page summarizes internal test results comparing large language models (LLMs) across Elastic Security AI chat and AI-powered feature use cases. These ratings apply equally whether you're using AI Assistant or Agent Builder. To learn more about these use cases, refer to AI-powered features.

Important

Higher scores indicate better performance. A score of 10 on a task means the model met or exceeded all task-specific benchmarks.

Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.

Proprietary models

Models from third-party LLM providers.

Model	Alerts	Security Knowledge	ES\|QL Query Generation	Knowledge Base Retrieval	Attack Discovery	Automatic Migration	Average Score
Opus 4.6	8.9	9.5	8.5	8.42	8.7	10	9
Sonnet 4.5	8.6	7.6	7.7	7.23	8	10	8.19
Opus 4.5	9	8.2	7.5	7.94	8.5	7.3	8.07
GPT 5.2	8.6	6.6	8	6	8.5	10	7.95
Sonnet 4	7.5	7.4	8	7.85	7	7.5	7.54
Sonnet 4.6	9.3	9.5	8.4	7.45	Not recommended	10	7.44
Sonnet 3.7	7.4	6.9	6.1	7.04	7	9.7	7.36
GPT 5.1	9.3	4.3	7.2	6	6.5	9.8	7.18
GPT 4.1 Mini	6.5	6.4	6	6.96	4.5	9.9	6.71
Gemini 2.5 Flash	7.8	6.2	4.4	5.71	6	9.81	6.65
Gemini 2.5 Pro	8	5.6	1.9	5.3	8.7	6.3	5.97
GPT 4.1	7.4	5.7	4.4	5.85	8	3.1	5.74

Open-source models

Models you can deploy yourself.

Model	Alerts	Security Knowledge	ES\|QL Query Generation	Knowledge Base Retrieval	Attack Discovery	Automatic Migration	Average Score
GPT OSS 120B	7.6	3.7	5.5	6	3.5	9.7	6
GPT OSS 20b	8.2	1.5	2.5	Not recommended	Not recommended	Not recommended	2.03