We’ll start with a basic example to help you get the hang of profiling queries. The following examples will use a movies data set.
Let’s start by importing the data:
LOAD CSV WITH HEADERS FROM "http://neo4j.com/docs/3.1.0-SNAPSHOT/csv/query-tuning/movies.csv" AS line MERGE (m:Movie { title:line.title }) ON CREATE SET m.released = toInt(line.released), m.tagline = line.tagline
LOAD CSV WITH HEADERS FROM 'http://neo4j.com/docs/3.1.0-SNAPSHOT/csv/query-tuning/actors.csv' AS line MATCH (m:Movie { title:line.title }) MERGE (p:Person { name:line.name }) ON CREATE SET p.born = toInt(line.born) MERGE (p)-[:ACTED_IN { roles:split(line.roles,";")}]->(m)
LOAD CSV WITH HEADERS FROM 'http://neo4j.com/docs/3.1.0-SNAPSHOT/csv/query-tuning/directors.csv' AS line MATCH (m:Movie { title:line.title }) MERGE (p:Person { name:line.name }) ON CREATE SET p.born = toInt(line.born) MERGE (p)-[:DIRECTED]->(m)
Let’s say we want to write a query to find Tom Hanks. The naive way of doing this would be to write the following:
MATCH (p { name:"Tom Hanks" }) RETURN p
This query will find the Tom Hanks node but as the number of nodes in the database increase it will become slower and slower. We can profile the query to find out why that is.
You can learn more about the options for profiling queries in Section 15.2, “How do I profile a query?” but in this case we’re going to prefix our query with PROFILE
:
PROFILE MATCH (p { name:"Tom Hanks" }) RETURN p
Compiler CYPHER 3.1 Planner COST Runtime INTERPRETED +-----------------+----------------+------+---------+-----------+---------------------------+ | Operator | Estimated Rows | Rows | DB Hits | Variables | Other | +-----------------+----------------+------+---------+-----------+---------------------------+ | +ProduceResults | 16 | 1 | 0 | p | p | | | +----------------+------+---------+-----------+---------------------------+ | +Filter | 16 | 1 | 163 | p | p.name == { AUTOSTRING0} | | | +----------------+------+---------+-----------+---------------------------+ | +AllNodesScan | 163 | 163 | 164 | p | | +-----------------+----------------+------+---------+-----------+---------------------------+ Total database accesses: 327
The first thing to keep in mind when reading execution plans is that you need to read from the bottom up.
In that vein, starting from the last row, the first thing we notice is that the value in the Rows
column seems high given there is only one node with the name property Tom Hanks
in the database.
If we look across to the Operator
column we’ll see that AllNodesScan has been used which means that the query planner scanned through all the nodes in the database.
Moving up to the previous row we see the Filter operator which will check the name
property on each of the nodes passed through by AllNodesScan
.
This seems like an inefficient way of finding Tom Hanks
given that we are looking at many nodes that aren’t even people and therefore aren’t what we’re looking for.
The solution to this problem is that whenever we’re looking for a node we should specify a label to help the query planner narrow down the search space.
For this query we’d need to add a Person
label.
MATCH (p:Person { name:"Tom Hanks" }) RETURN p
This query will be faster than the first one but as the number of people in our database increase we again notice that the query slows down.
Again we can profile the query to work out why:
PROFILE MATCH (p:Person { name:"Tom Hanks" }) RETURN p
Compiler CYPHER 3.1 Planner COST Runtime INTERPRETED +------------------+----------------+------+---------+-----------+---------------------------+ | Operator | Estimated Rows | Rows | DB Hits | Variables | Other | +------------------+----------------+------+---------+-----------+---------------------------+ | +ProduceResults | 13 | 1 | 0 | p | p | | | +----------------+------+---------+-----------+---------------------------+ | +Filter | 13 | 1 | 125 | p | p.name == { AUTOSTRING0} | | | +----------------+------+---------+-----------+---------------------------+ | +NodeByLabelScan | 125 | 125 | 126 | p | :Person | +------------------+----------------+------+---------+-----------+---------------------------+ Total database accesses: 251
This time the Rows
value on the last row has reduced so we’re not scanning some nodes that we were before which is a good start.
The NodeByLabelScan operator indicates that we achieved this by first doing a linear scan of all the Person
nodes in the database.
Once we’ve done that we again scan through all those nodes using the Filter
operator, comparing the name property of each one.
This might be acceptable in some cases but if we’re going to be looking up people by name frequently then we’ll see better performance if we create an index on the name
property for the Person
label:
CREATE INDEX ON :Person(name)
Now if we run the query again it will run more quickly:
MATCH (p:Person { name:"Tom Hanks" }) RETURN p
Let’s profile the query to see why that is:
PROFILE MATCH (p:Person { name:"Tom Hanks" }) RETURN p
Compiler CYPHER 3.1 Planner COST Runtime INTERPRETED +-----------------+----------------+------+---------+-----------+---------------+ | Operator | Estimated Rows | Rows | DB Hits | Variables | Other | +-----------------+----------------+------+---------+-----------+---------------+ | +ProduceResults | 1 | 1 | 0 | p | p | | | +----------------+------+---------+-----------+---------------+ | +NodeIndexSeek | 1 | 1 | 2 | p | :Person(name) | +-----------------+----------------+------+---------+-----------+---------------+ Total database accesses: 2
Our execution plan is down to a single row and uses the Node Index Seek operator which does a schema index seek (see Section 14.1, “Indexes”) to find the appropriate node.