Most fields are indexed by default, which makes them searchable. Sorting, aggregations, and accessing field values in scripts, however, requires a different access pattern from search.
Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for this document?".
Most fields can use index-time, on-disk doc_values
for this
data access pattern, but text
fields do not support doc_values
.
Instead, text
fields use a query-time in-memory data structure called
fielddata
. This data structure is built on demand the first time that a
field is used for aggregations, sorting, or in a script. It is built by
reading the entire inverted index for each segment from disk, inverting the
term ↔︎ document relationship, and storing the result in memory, in the JVM
heap.
Fielddata can consume a lot of heap space, especially when loading high
cardinality text
fields. Once fielddata has been loaded into the heap, it
remains there for the lifetime of the segment. Also, loading fielddata is an
expensive process which can cause users to experience latency hits. This is
why fielddata is disabled by default.
If you try to sort, aggregate, or access values from a script on a text
field, you will see this exception:
Fielddata is disabled on text fields by default. Set
fielddata=true
on [your_field_name
] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.
Before you enable fielddata, consider why you are using a text
field for
aggregations, sorting, or in a script. It usually doesn’t make sense to do
so.
A text field is analyzed before indexing so that a value like
New York
can be found by searching for new
or for york
. A terms
aggregation on this field will return a new
bucket and a york
bucket, when
you probably want a single bucket called New York
.
Instead, you should have a text
field for full text searches, and an
unanalyzed keyword
field with doc_values
enabled for aggregations, as follows:
You can enable fielddata on an existing text
field using the
PUT mapping API as follows:
Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency:
The frequency filter allows you to only load terms whose document frequency falls
between a min
and max
value, which can be expressed an absolute
number (when the number is bigger than 1.0) or as a percentage
(eg 0.01
is 1%
and 1.0
is 100%
). Frequency is calculated
per segment. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.
Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with min_segment_size
:
PUT my_index { "mappings": { "properties": { "tag": { "type": "text", "fielddata": true, "fielddata_frequency_filter": { "min": 0.001, "max": 0.1, "min_segment_size": 500 } } } } }