14.3. Statistics

When you issue a Cypher query, it gets compiled to an execution plan (see Chapter 16, Execution Plans) that can run and answer your question. To produce an efficient plan for your query, Neo4j needs information about your database, such as the schema — what indexes and constraints do exist? Neo4j will also use statistical information it keeps about your database to optimize the execution plan. With this information, Neo4j can decide which access pattern leads to the best performing plans.

The statistical information that Neo4j keeps is:

  1. The number of nodes with a certain label.
  2. Selectivity per index.
  3. The number of relationships by type.
  4. The number of relationships by type, ending or starting from a node with a specific label.

Neo4j keeps the statistics up to date in two different ways. For label counts for example, the number is updated whenever you set or remove a label from a node. For indexes, Neo4j needs to scan the full index to produce the selectivity number. Since this is potentially a very time-consuming operation, these numbers are collected in the background when enough data on the index has been changed.

Configuration options

Execution plans are cached and will not be replanned until the statistical information used to produce the plan has changed. The following configuration options allows you to control how sensitive replanning should be to updates of the database.

dbms.index_sampling.background_enabled

Controls whether indexes will automatically be re-sampled when they have been updated enough. The Cypher query planner depends on accurate statistics to create efficient plans, so it is important it is kept up to date as the database evolves.

[Tip]Tip

If background sampling is turned off, make sure to trigger manual sampling when data has been updated.

dbms.index_sampling.update_percentage
Controls how large portion of the index has to have been updated before a new sampling run is triggered.
cypher.statistics_divergence_threshold
Controls how much the above statistical information is allowed to change before an execution plan is considered stale and has to be replanned. If the relative change in any of statistics is larger than this threshold, the plan will be thrown away and a new one will be created. A threshold of 0.0 means always replan, and a value of 1.0 means never replan.

Managing statistics from the shell

Usage:

schema sample -a
will sample all indexes.
schema sample -l Person -p name
will sample the index for label Person on property name (if existing).
schema sample -a -f
will force a sample of all indexes.
schema sample -f -l :Person -p name
will force sampling of a specific index.