This functionality is marked as experimental in Lucene
Named word_delimiter_graph
, it splits words into subwords and performs
optional transformations on subword groups. Words are split into
subwords with the following rules:
Unlike the word_delimiter
, this token filter correctly handles positions for
multi terms expansion at search-time when any of the following options
are set to true:
preserve_original
catenate_numbers
catenate_words
catenate_all
Parameters include:
generate_word_parts
true
causes parts of words to be
generated: "PowerShot" ⇒ "Power" "Shot". Defaults to true
.
generate_number_parts
true
causes number subwords to be
generated: "500-42" ⇒ "500" "42". Defaults to true
.
catenate_words
true
causes maximum runs of word parts to be
catenated: "wi-fi" ⇒ "wifi". Defaults to false
.
catenate_numbers
true
causes maximum runs of number parts to
be catenated: "500-42" ⇒ "50042". Defaults to false
.
catenate_all
true
causes all subword parts to be catenated:
"wi-fi-4000" ⇒ "wifi4000". Defaults to false
.
split_on_case_change
true
causes "PowerShot" to be two tokens;
("Power-Shot" remains two parts regards). Defaults to true
.
preserve_original
true
includes original words in subwords:
"500-42" ⇒ "500-42" "500" "42". Defaults to false
.
split_on_numerics
true
causes "j2se" to be three tokens; "j"
"2" "se". Defaults to true
.
stem_english_possessive
true
causes trailing "'s" to be
removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults to true
.
Advance settings include:
protected_words
protected_words_path
which resolved
to a file configured with protected words (one on each line).
Automatically resolves to config/
based location if exists.
adjust_offsets
trim
) this can cause tokens with
illegal offsets to be emitted. Setting adjust_offsets
to false will
stop word_delimiter_graph
from adjusting these internal offsets.
type_table
type_table_path
):
# Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ => DIGIT % => DIGIT . => DIGIT \\u002C => DIGIT # in some cases you might not want to split on ZWJ # this also tests the case where we need a bigger byte[] # see http://en.wikipedia.org/wiki/Zero-width_joiner \\u200D => ALPHANUM
Using a tokenizer like the standard
tokenizer may interfere with
the catenate_*
and preserve_original
parameters, as the original
string may already have lost punctuation during tokenization. Instead,
you may want to use the whitespace
tokenizer.