﻿---
title: Fingerprint token filter
description: Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token. For example, this filter changes the...
url: https://www.elastic.co/elastic/docs-builder/docs/3016/reference/text-analysis/analysis-fingerprint-tokenfilter
products:
  - Elasticsearch
---

# Fingerprint token filter
Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token.
For example, this filter changes the `[ the, fox, was, very, very, quick ]` token stream as follows:
1. Sorts the tokens alphabetically to `[ fox, quick, the, very, very, was ]`
2. Removes a duplicate instance of the `very` token.
3. Concatenates the token stream to a output single token: `[fox quick the very was ]`

Output tokens produced by this filter are useful for fingerprinting and clustering a body of text as described in the [OpenRefine project](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint).
This filter uses Lucene’s [FingerprintFilter](https://lucene.apache.org/core/10_0_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.md).

## Example

The following [analyze API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-analyze) request uses the `fingerprint` filter to create a single output token for the text `zebra jumps over resting resting dog`:
```json

{
  "tokenizer" : "whitespace",
  "filter" : ["fingerprint"],
  "text" : "zebra jumps over resting resting dog"
}
```

The filter produces the following token:
```text
[ dog jumps over resting zebra ]
```


## Add to an analyzer

The following [create index API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-create) request uses the `fingerprint` filter to configure a new [custom analyzer](https://docs-v3-preview.elastic.dev/elastic/docs-builder/docs/3016/manage-data/data-store/text-analysis/create-custom-analyzer).
```json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_fingerprint": {
          "tokenizer": "whitespace",
          "filter": [ "fingerprint" ]
        }
      }
    }
  }
}
```


## Configurable parameters


<definitions>
  <definition term="max_output_size">
    (Optional, integer) Maximum character length, including whitespace, of the output token. Defaults to `255`. Concatenated tokens longer than this will result in no token output.
  </definition>
  <definition term="separator">
    (Optional, string) Character to use to concatenate the token stream input. Defaults to a space.
  </definition>
</definitions>


## Customize

To customize the `fingerprint` filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following request creates a custom `fingerprint` filter with that use `+` to concatenate token streams. The filter also limits output tokens to `100` characters or fewer.
```json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_": {
          "tokenizer": "whitespace",
          "filter": [ "fingerprint_plus_concat" ]
        }
      },
      "filter": {
        "fingerprint_plus_concat": {
          "type": "fingerprint",
          "max_output_size": 100,
          "separator": "+"
        }
      }
    }
  }
}
```