﻿---
title: Inference integrations
description: Elasticsearch provides a machine learning inference API to create and manage inference endpoints that integrate with services such as Elasticsearch (for...
url: https://www.elastic.co/elastic/docs-builder/docs/3016/explore-analyze/elastic-inference/inference-api
products:
  - Kibana
applies_to:
  - Elastic Cloud Serverless: Generally available
  - Elastic Stack: Generally available
---

# Inference integrations
Elasticsearch provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-inference) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](https://www.elastic.co/elastic/docs-builder/docs/3016/explore-analyze/machine-learning/nlp/ml-nlp-elser) and [E5](https://www.elastic.co/elastic/docs-builder/docs/3016/explore-analyze/machine-learning/nlp/ml-nlp-e5)), as well as  popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more.
You can use the default inference endpoints your deployment contains or create a new inference endpoint:
- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put)
- through the [Inference endpoints UI](#add-inference-endpoints).


## Default inference endpoints

Your Elasticsearch deployment contains preconfigured inference endpoints, which makes them easier to use when defining `semantic_text` fields or using inference processors. These endpoints come in two forms:
- **Elastic Inference Service (EIS) endpoints**, which provide inference as a managed service and do not consume resources from your own nodes.
- **ML node-based endpoints**, which run on your dedicated machine learning nodes.

The following section lists the default inference endpoints, identified by their `inference_id`, grouped by whether they are EIS- or ML node–based.

### Default endpoints for Elastic Inference Service (EIS)

- `.elser-2-elastic`: uses the [ELSER](https://www.elastic.co/elastic/docs-builder/docs/3016/explore-analyze/machine-learning/nlp/ml-nlp-elser) trained model as an Elastic Inference Service for `sparse_embedding` tasks (recommended for English language text). The `model_id` is `.elser_model_2`. <applies-to>Elastic Stack: Preview since 9.1</applies-to> <applies-to>Self-managed Elastic deployments: Unavailable</applies-to> <applies-to>Elastic Cloud Serverless: Preview</applies-to>


### Default endpoints used on ML-nodes

- `.elser-2-elasticsearch`: uses the [ELSER](https://www.elastic.co/elastic/docs-builder/docs/3016/explore-analyze/machine-learning/nlp/ml-nlp-elser) built-in trained model for `sparse_embedding` tasks (recommended for English language text). The `model_id` is `.elser_model_2_linux-x86_64`.
- `.multilingual-e5-small-elasticsearch`: uses the [E5](https://www.elastic.co/elastic/docs-builder/docs/3016/explore-analyze/machine-learning/nlp/ml-nlp-e5) built-in trained model for `text_embedding` tasks (recommended for non-English language texts). The `model_id` is `.e5_model_2_linux-x86_64`.

Use the `inference_id` of the endpoint in a [`semantic_text`](https://docs-v3-preview.elastic.dev/elastic/docs-builder/docs/3016/reference/elasticsearch/mapping-reference/semantic-text) field definition or when creating an [inference processor](https://docs-v3-preview.elastic.dev/elastic/docs-builder/docs/3016/reference/enrich-processor/inference-processor). The API call will automatically download and deploy the model which might take a couple of minutes. Default inference enpoints have adaptive allocations enabled. For these models, the minimum number of allocations is `0`. If there is no inference activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes.

## Inference endpoints UI

The **Inference endpoints** page provides an interface for managing inference endpoints.
![Inference endpoints UI](https://www.elastic.co/elastic/docs-builder/docs/3016/explore-analyze/images/kibana-inference-endpoints-ui.png)

Available actions:
- Add new endpoint
- View endpoint details
- Copy the inference endpoint ID
- Delete endpoints


## Add new inference endpoint

To add a new inference endpoint using the UI:
1. Select the **Add endpoint** button.
2. Select a service from the drop down menu.
3. Provide the required configuration details.
4. Select **Save** to create the endpoint.

If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand.

### Creating custom EIS endpoints

Your deployment includes [Default inference endpoints](#default-enpoints) which are preconfigured and ready to use. In most cases, you should use these default endpoints.
However, you may choose to manually create a **custom Elastic Inference Service (EIS)** endpoint if you need to instantiate a specific model version or configuration that is not covered by the defaults.
To create a custom EIS endpoint:
1. In the **Service** dropdown, select **Elastic Inference Service**.
2. In the **Settings** section, enter the specific **Model ID**. For a complete list of valid Model IDs and their corresponding task types, refer to the [Elastic Inference Service supported models](/elastic/docs-builder/docs/3016/explore-analyze/elastic-inference/eis#supported-models).
3. (Optional) Under **More options**, set the **Maximum Input Tokens**. This limits the number of tokens processed per request. If left blank, the model's default limit is used.
4. Expand **Additional settings** and select the **Task type** that corresponds to your model.
5. Select **Save**.


## Adaptive allocations

Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for models used through the Elastic Inference Service (EIS) and third-party services (for example, Alibaba Cloud, Cohere, or OpenAI), because those models are not deployed within your Elasticsearch cluster.
When adaptive allocations are enabled:
- The number of allocations scales up automatically when the load increases.
- Allocations scale down to a minimum of 0 when the load decreases, saving resources.


### Allocation scaling behavior

The behavior of allocations depends on several factors:
- Deployment type (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless)
- Usage level (low, medium, or high)
- Optimization type ([ingest](/elastic/docs-builder/docs/3016/deploy-manage/autoscaling/trained-model-autoscaling#ingest-optimized) or [search](/elastic/docs-builder/docs/3016/deploy-manage/autoscaling/trained-model-autoscaling#search-optimized))

<important>
  If you enable adaptive allocations and set the `min_number_of_allocations` to a value greater than `0`, you will be charged for the machine learning resources, even if no inference requests are sent.However, setting the `min_number_of_allocations` to a value greater than `0` keeps the model always available without scaling delays. Choose the configuration that best fits your workload and availability needs.
</important>

For more information about adaptive allocations and resources, refer to the [trained model autoscaling](https://www.elastic.co/elastic/docs-builder/docs/3016/deploy-manage/autoscaling/trained-model-autoscaling) documentation.

## Configuring chunking

Inference endpoints have a limit on the amount of text they can process at once, determined by the model's input capacity. Chunking is the process of splitting the input text into pieces that remain within these limits.
It occurs when ingesting documents into [`semantic_text` fields](https://docs-v3-preview.elastic.dev/elastic/docs-builder/docs/3016/reference/elasticsearch/mapping-reference/semantic-text). Chunking also helps produce sections that are digestible for humans. Returning a long document in search results is less useful than providing the most relevant chunk of text.
Each chunk will include the text subpassage and the corresponding embedding generated from it.
By default, documents are split into sentences and grouped in sections up to 250 words with 1 sentence overlap so that each chunk shares a sentence with the previous chunk. Overlapping ensures continuity and prevents vital contextual information in the input text from being lost by a hard break.
Elasticsearch uses the [ICU4J](https://unicode-org.github.io/icu-docs/) library to detect word and sentence boundaries for chunking. [Word boundaries](https://unicode-org.github.io/icu/userguide/boundaryanalysis/#word-boundary) are identified by following a series of rules, which include detecting the presence of a whitespace character. For written languages that do not use whitespace, such as Chinese or Japanese, dictionary lookups are used to detect word boundaries.

### Chunking strategies

Several strategies are available for chunking:

#### `sentence`

The `sentence` strategy splits the input text at sentence boundaries. Each chunk contains one or more complete sentences ensuring that the integrity of sentence-level context is preserved, except when a sentence causes a chunk to exceed a word count of `max_chunk_size`, in which case it will be split across chunks. The `sentence_overlap` option defines the number of sentences from the previous chunk to include in the current chunk which is either `0` or `1`.
The following example creates an inference endpoint with the `elasticsearch` service that deploys the ELSER model and configures the chunking behavior with the `sentence` strategy.
```json

{
  "service": "elasticsearch",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "chunking_settings": {
    "strategy": "sentence",
    "max_chunk_size": 100,
    "sentence_overlap": 0
  }
}
```

The default chunking strategy is `sentence`.

#### `word`

The `word` strategy splits the input text on individual words up to the `max_chunk_size` limit. The `overlap` option is the number of words from the previous chunk to include in the current chunk.
The following example creates an inference endpoint with the `elasticsearch` service that deploys the ELSER model and configures the chunking behavior with the `word` strategy, setting a maximum of 120 words per chunk and an overlap of 40 words between chunks.
```json

{
  "service": "elasticsearch",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "chunking_settings": {
    "strategy": "word",
    "max_chunk_size": 120,
    "overlap": 40
  }
}
```


#### `recursive`

<applies-to>
  - Elastic Stack: Generally available since 9.1
</applies-to>

The `recursive` strategy splits the input text based on a configurable list of separator patterns, such as paragraph boundaries or Markdown structural elements like headings and horizontal rules. The chunker applies these separators in order, recursively splitting any chunk that exceeds the `max_chunk_size` word limit. If no separator produces a small enough chunk, the strategy falls back to [sentence-level splitting](#sentence).
You can configure the `recursive` strategy using either:
- [Predefined separator groups](#separator-groups): [`Plaintext`](#plaintext) or [`markdown`](#markdown)
- [Custom separators](#custom-separators): Define your own regular expression patterns


##### Predefined separator groups

Predefined separator groups provide optimized patterns for common text formats: [`plaintext`](#plaintext) works for simple line-structured text without markup, and [`markdown`](#markdown) works for Markdown-formatted content.

###### `plaintext`

The `plaintext` separator group splits text at paragraph boundaries, first attempting to split on double newlines (paragraph breaks), then falling back to single newlines when chunks are still too large.
<dropdown title="Regular expression patterns for the `plaintext` separator group">
  1. `(?<!\\n)\\n\\n(?!\\n)`: Splits on consecutive newlines that indicate paragraph breaks.
  2. `(?<!\\n)\\n(?!\\n)`: Splits on single newlines when double newlines don't produce small enough chunks.
</dropdown>

The following example configures chunking with the `recursive` strategy using the `plaintext` separator group and a maximum of 200 words per chunk.
```json

{
  "service": "elasticsearch",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "chunking_settings": {
    "strategy": "recursive",
    "max_chunk_size": 200,
    "separator_group": "plaintext"
  }
}
```


###### `markdown`

The `markdown` separator group splits text based on Markdown structural elements, processing separators hierarchically from highest to lowest level: H1 through H6 headings, then horizontal rules.
<dropdown title="Regular expression patterns for the `markdown` separator group">
  1. `\n# `: Splits on level 1 headings (H1).
  2. `\n## `: Splits on level 2 headings (H2).
  3. `\n### `: Splits on level 3 headings (H3).
  4. `\n#### `: Splits on level 4 headings (H4).
  5. `\n##### `: Splits on level 5 headings (H5).
  6. `\n###### `: Splits on level 6 headings (H6).
  7. `\n^(?!\\s*$).*\\n-{1,}\\n`: Splits on horizontal rules created with hyphens.
  8. `\n^(?!\\s*$).*\\n={1,}\\n`: Splits on horizontal rules created with equals signs.
</dropdown>

The following example configures chunking with the `recursive` strategy using the `markdown` separator group and a maximum of 200 words per chunk.
```json

{
  "service": "elasticsearch",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "chunking_settings": {
    "strategy": "recursive",
    "max_chunk_size": 200,
    "separator_group": "markdown"
  }
}
```


##### Custom separators

If the [predefined separator groups](#separator-groups) don't meet your needs, you can define custom separators using regular expressions. The following example configures chunking with the `recursive` strategy using a custom list of separators to split text into chunks of up to 180 words.
```json

{
  "service": "elasticsearch",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "chunking_settings": {
    "strategy": "recursive",
    "max_chunk_size": 180,
    "separators": [
      "^(#{1,6})\\s",
      "\\n\\n",
      "\\n[-*]\\s",
      "\\n\\d+\\.\\s",
      "\\n"
    ]
  }
}
```


#### `none`

<applies-to>
  - Elastic Stack: Generally available since 9.1
</applies-to>

The `none` strategy disables chunking and processes the entire input text as a single block, without any splitting or overlap. When using this strategy, you can instead [pre-chunk](https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text#auto-text-chunking) the input by providing an array of strings, where each element acts as a separate chunk to be sent directly to the inference service without further chunking.
The following example creates an inference endpoint with the `elasticsearch` service that deploys the ELSER model and disables chunking by setting the strategy to `none`.
```json

{
  "service": "elasticsearch",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "chunking_settings": {
    "strategy": "none"
  }
}
```