Custom Analyzer

» » »

Custom Analyzer

When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

zero or more character filters
a tokenizer
zero or more token filters.

Configuration

The custom analyzer accepts the following parameters:

`tokenizer`	A built-in or customised tokenizer. (Required)
`char_filter`	An optional array of built-in or customised character filters.
`filter`	An optional array of built-in or customised token filters.
`position_increment_gap`	When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to `100`. See `position_increment_gap` for more.

Example configuration

Here is an example that combines the following:

Character Filter

HTML Strip Character Filter

Tokenizer

Standard Tokenizer

Token Filters

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

Copy as cURL View in Console

Setting type to custom tells Elasticsearch that we are defining a custom analyzer. Compare this to how built-in analyzers can be configured: type will be set to the name of the built-in analyzer, like standard or simple.

The above example produces the following terms:

[ is, this, deja, vu ]

The previous example used tokenizer, token filters, and character filters with their default configurations, but it is possible to create configured versions of each and to use them in a custom analyzer.

Here is a more complicated example that combines the following:

Character Filter

Mapping Character Filter, configured to replace :) with _happy_ and :( with _sad_

Tokenizer

Pattern Tokenizer, configured to split on punctuation characters

Token Filters

Lowercase Token Filter
Stop Token Filter, configured to use the pre-defined list of English stop words

Here is an example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "emoticons" 
          ],
          "tokenizer": "punctuation", 
          "filter": [
            "lowercase",
            "english_stop" 
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}

Copy as cURL View in Console

The emoticons character filter, punctuation tokenizer and english_stop token filter are custom implementations which are defined in the same index settings.

The above example produces the following terms:

[ i'm, _happy_, person, you ]

« Fingerprint Analyzer Normalizers »

Custom Analyzer

Configuration

Example configuration

Getting Started Videos

Be in the know with the latest and greatest from Elastic.