Semantic reranking

Tip

This overview focuses more on the high-level concepts and use cases for semantic re-ranking. For full implementation details on how to set up and use semantic re-ranking in Elasticsearch, see the reference documentation in the Search API docs.

Re-rankers improve the relevance of results from earlier-stage retrieval mechanisms. Semantic re-rankers use machine learning models to reorder search results based on their semantic similarity to a query.

Semantic re-ranking requires relatively large and complex machine learning models and operates in real-time in response to queries. This technique makes sense on a small top-k result set, as one of the final steps in a pipeline. This is a powerful technique for improving search relevance that works equally well with keyword, semantic, or hybrid retrieval algorithms.

The next sections provide more details on the benefits, use cases, and model types used for semantic re-ranking. The final sections include a practical, high-level overview of how to implement semantic re-ranking in Elasticsearch and links to the full reference documentation.

Use cases

Semantic re-ranking enables a variety of use cases:

Lexical (BM25) retrieval results re-ranking
- Out-of-the-box semantic search by adding a simple API call to any lexical/BM25 retrieval pipeline.
- Adds semantic search capabilities on top of existing indices without reindexing, perfect for quick improvements.
- Ideal for environments with complex existing indices.
Semantic retrieval results re-ranking
- Improves results from semantic retrievers using ELSER sparse vector embeddings or dense vector embeddings by using more powerful models.
- Adds a refinement layer on top of hybrid retrieval with reciprocal rank fusion (RRF) or linear combination using the linear retriever
General applications
- Provides explicit control over document relevance in retrieval-augmented generation (RAG) use cases or other scenarios involving language model (LLM) inputs.

Now that we’ve outlined the value of semantic re-ranking, we’ll explore the specific models that power this process and how they differ.

Cross-encoder and bi-encoder models

At a high level, two model types are used for semantic re-ranking: cross-encoders and bi-encoders.

Note

In this version, Elasticsearch only supports cross-encoders for semantic re-ranking.

A cross-encoder model can be thought of as a more powerful, all-in-one solution, because it generates query-aware document representations. It takes the query and document texts as a single, concatenated input.
A bi-encoder model takes as input either document or query text. Documents and query embeddings are computed separately, so they aren’t aware of each other.
- To compute a ranking score, an external operation is required. This typically involves computing dot-product or cosine similarity between the query and document embeddings.

In brief, cross-encoders provide high accuracy but are more resource-intensive. Bi-encoders are faster and more cost-effective but less precise.

In future versions, Elasticsearch will also support bi-encoders. If you’re interested in a more detailed analysis of the practical differences between cross-encoders and bi-encoders, untoggle the next section.

Semantic re-ranking in Elasticsearch

Elasticsearch provides two ways to add semantic re-ranking to your search pipeline:

Using the text_similarity_reranker retriever
Using the ES|QL RERANK command

Both use the same underlying inference endpoints and re-ranking models.

Step 1: Configure a re-ranking model

Both approaches require an inference endpoint configured for the rerank task. You have the following options:

Use the Elastic Rerank cross-encoder model through a preconfigured .rerank-v1-elasticsearch endpoint or create a custom one using the inference API’s Elasticsearch service.
Use the Jina AI Rerank inference endpoint to create a rerank endpoint.
Use the Cohere Rerank inference endpoint to create a rerank endpoint.
Use the Google Vertex AI inference endpoint to create a rerank endpoint.
Upload a model to Elasticsearch from Hugging Face with Eland. You’ll need to use the text_similarity NLP task type when loading the model using Eland. Then set up an Elasticsearch service inference endpoint with the rerank endpoint type.

Refer to the Elastic NLP model reference for a list of third party text similarity models supported by Elasticsearch for semantic re-ranking.

Step 2: Choose an implementation approach

You can use either retrievers or ES|QL to implement semantic re-ranking in your search pipelines.

Use the `text_similarity_reranker` retriever

Use the retriever syntax to compose multi-stage retrieval pipelines declaratively within a single _search call. This is a good fit when you want to combine re-ranking with other retriever stages like RRF, linear combination, or pinning.

Create a rerank endpoint using the Elasticsearch Inference API, then define a text_similarity_reranker retriever in your search request.

						POST _search
					{
  "retriever": {
    "text_similarity_reranker": {
      "retriever": {
        "standard": {
          "query": {
            "match": {
              "text": "How often does the moon hide the sun?"
            }
          }
        }
      },
      "field": "text",
      "inference_id": "elastic-rerank",
      "inference_text": "How often does the moon hide the sun?",
      "rank_window_size": 100,
      "min_score": 0.5
    }
  }
}
		
	

For full reference documentation, refer to the text_similarity_reranker retriever.

Use the ES|QL `RERANK` command

Use the ES|QL RERANK command to add re-ranking as a step in a piped query. This is a good fit when you want to combine re-ranking with other ES|QL capabilities like transformations, aggregations, or text generation with COMPLETION.

For full reference documentation, refer to the RERANK command.

When to use which

	Retrievers	ES\|QL
Syntax	Declarative JSON (retriever tree)	Piped query language
Composability	Nest with other retrievers (RRF, linear, pinned, diversify)	Pipe with other commands (FORK, FUSE, COMPLETION, STATS)
Best for	Multi-stage retrieval pipelines in the `_search` API	End-to-end search workflows that include transformations or generation
Client support	All Elasticsearch clients	All Elasticsearch clients

Handling long documents

A limitation of many cross-encoder models is that they do not perform well when used on corpora with long documents. This is because many models truncate input to the length of their token window, which can potentially cut off the most relevant part of the document before it is sent to the reranker. The preconfigured .rerank-v1-elasticsearch endpoint truncates in this manner.

The chunk_rescorer in the text_similarity_reranker retriever allows explicit control over how much content is sent to the reranker. This addresses the long document problem and also allows control of inference costs by sending fewer tokens into the reranker.

Warning

Reranking on scored chunks is an expert feature that can negatively impact relevance if used with models that don’t perform truncation.

The following example shows the semantic reranker using the chunk_rescorer to control chunk sizes using default settings.

						POST _search
					{
  "retriever": {
    "text_similarity_reranker": {
      "retriever": {
        "standard": {
          "query": {
            "match": {
              "text": "How often does the moon hide the sun?"
            }
          }
        }
      },
      "field": "text",
      "inference_id": "elastic-rerank",
      "inference_text": "How often does the moon hide the sun?",
      "chunk_rescorer": {},
      "rank_window_size": 100,
      "min_score": 0.5
    }
  }
}
		
	

Learn more

Read the text_similarity_reranker retriever reference for syntax and implementation details
Read the ES|QL RERANK command reference for the piped query approach
Learn more about the retrievers abstraction
Learn more about choosing a query interface for your search use case
Learn more about the Elastic Inference APIs
Check out our Python notebook for using Cohere with Elasticsearch