nori_part_of_speech token filter
The nori_part_of_speech
token filter removes tokens that match a set of part-of-speech tags. The list of supported tags and their meanings can be found here: Part of speech tags
It accepts the following setting:
stoptags
- An array of part-of-speech tags that should be removed.
and defaults to:
"stoptags": [
"E",
"IC",
"J",
"MAG", "MAJ", "MM",
"SP", "SSC", "SSO", "SC", "SE",
"XPN", "XSA", "XSN", "XSV",
"UNA", "NA", "VSV"
]
For example:
PUT nori_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "nori_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "nori_part_of_speech",
"stoptags": [
"NR" 1
]
}
}
}
}
}
}
GET nori_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "여섯 용이" 2
}
- Korean numerals should be removed (
NR
) - Six dragons
Which responds with:
{
"tokens" : [ {
"token" : "용",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
}, {
"token" : "이",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 2
} ]
}