﻿---
title: ICU tokenizer
description: Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds...
url: https://www.elastic.co/elastic/docs-builder/docs/3016/reference/elasticsearch/plugins/analysis-icu-tokenizer
products:
  - Elasticsearch
---

# ICU tokenizer
Tokenizes text into words on word boundaries, as defined in [UAX #29: Unicode Text Segmentation](https://www.unicode.org/reports/tr29/). It behaves much like the [`standard` tokenizer](https://www.elastic.co/elastic/docs-builder/docs/3016/reference/text-analysis/analysis-standard-tokenizer), but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.
```json

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}
```


## Rules customization

<warning>
  This functionality is marked as experimental in Lucene
</warning>

You can customize the `icu-tokenizer` behavior by specifying per-script rule files, see the [RBBI rules syntax reference](http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules) for a more detailed explanation.
To add icu tokenizer rules, set the `rule_files` settings, which should contain a comma-separated list of `code:rulefile` pairs in the following format: [four-letter ISO 15924 script code](https://unicode.org/iso15924/iso15924-codes.html), followed by a colon, then a rule file name. Rule files are placed `ES_HOME/config` directory.
As a demonstration of how the rule files can be used, save the following user file to `$ES_HOME/config/KeywordTokenizer.rbbi`:
```text
.+ {200};
```

Then create an analyzer to use this rule file as follows:
```json

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "icu_user_file": {
            "type": "icu_tokenizer",
            "rule_files": "Latn:KeywordTokenizer.rbbi"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "icu_user_file"
          }
        }
      }
    }
  }
}


{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch. Wow!"
}
```

The above `analyze` request returns the following:
```json
{
   "tokens": [
      {
         "token": "Elasticsearch. Wow!",
         "start_offset": 0,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}
```