﻿---
title: Pattern tokenizer
description: The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms...
url: https://www.elastic.co/elastic/docs-builder/docs/3016/reference/text-analysis/analysis-pattern-tokenizer
products:
  - Elasticsearch
---

# Pattern tokenizer
The `pattern` tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.
The default pattern is `\W+`, which splits text whenever it encounters non-word characters.
<admonition title="Beware of Pathological Regular Expressions">
  The pattern tokenizer uses [Java Regular Expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md).A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.Read more about [pathological regular expressions and how to avoid them](https://www.regular-expressions.info/catastrophic.html).
</admonition>


## Example output

```json

{
  "tokenizer": "pattern",
  "text": "The foo_bar_size's default is 5."
}
```

The above sentence would produce the following terms:
```text
[ The, foo_bar_size, s, default, is, 5 ]
```


## Configuration

The `pattern` tokenizer accepts the following parameters:
<definitions>
  <definition term="pattern">
    A [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md), defaults to `\W+`.
  </definition>
  <definition term="flags">
    Java regular expression [flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md#field.summary). Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
  </definition>
  <definition term="group">
    Which capture group to extract as tokens. Defaults to `-1` (split).
  </definition>
</definitions>


## Example configuration

In this example, we configure the `pattern` tokenizer to break text into tokens when it encounters commas:
```json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}


{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}
```

The above example produces the following terms:
```text
[ comma, separated, values ]
```

In the next example, we configure the `pattern` tokenizer to capture values enclosed in double quotes (ignoring embedded escaped quotes `\"`). The regex itself looks like this:
```
"((?:\\"|[^"]|\\")*)"
```

And reads as follows:
- A literal `"`
- Start capturing:
  - A literal `\"` OR any character except `"`
- Repeat until no more characters match
- A literal closing `"`

When the pattern is specified in JSON, the `"` and `\` characters need to be escaped, so the pattern ends up looking like:
```
\"((?:\\\\\"|[^\"]|\\\\\")+)\"
```

```json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
          "group": 1
        }
      }
    }
  }
}


{
  "analyzer": "my_analyzer",
  "text": "\"value\", \"value with embedded \\\" quote\""
}
```

The above example produces the following two terms:
```text
[ value, value with embedded \" quote ]
```