Query Performance in Neo4j

[ databases neo4j easi ]

I was going through the Cypher Manaul today, and started playing around with EXPLAIN and PROFILE to learn more about how Neo4j formulates execution plans. The most important piece of advice to extract is: Do not be lazy while writing your queries!

This means, remove as many ambiguities and guess work from your query as possible:

If you know the node’s label, include it
- Wrong: MATCH (p {name: "Romeo"}) RETURN p
- Right: MATCH (p:Person {name: "Romeo"}) RETURN p
If you know the relationship type, include it
- Wrong: MATCH (p {name: "Romeo"}) --> (m) RETURN m
- Right: MATCH (p:Person {name: "Romeo"}) -[:LIKES]-> (m:Movie) RETURN m
If it’s an important, highly-used query – put an index on it
- e.g., CREATE INDEX on :Person(name)

Playing around with Neo4j Browser on a small data set just for fun? Then sure – it might feel good have a dirty rum martini, get behind the Cypher wheel, and query a bit sloppy (what’s a few measly milliseconds anyway?!). But best ye be puttin’ that drink down, lest ye beget bad habits: the same sloppiness on large data sets in production can make your product feel sluggish and painful.

You might be asking right about now: “Wtf is a dirty rum martini?”

Never you mind!

An Example

Honestly, the Cypher Manaul has a great example – and for my own ease-of-review, much of it has been liberated from there and put here, though with various adjustments and additional experiments.

Side Note: I found that my query profiles differed from that in the Cypher Manual. My hypothesis is that this is likely due to using a different version of Neo4j or Cypher (e.g., Community vs Enterprise, or just more up-to-date version in general).

Get the Data

// Movies
LOAD CSV WITH HEADERS FROM 'https://neo4j.com/docs/cypher-manual/3.5/csv/query-tuning/movies.csv' AS line
MERGE (m:Movie { title: line.title })
ON CREATE SET m.released = toInteger(line.released), m.tagline = line.tagline

// Actors
LOAD CSV WITH HEADERS FROM 'https://neo4j.com/docs/cypher-manual/3.5/csv/query-tuning/actors.csv' AS line
MATCH (m:Movie { title: line.title })
MERGE (p:Person { name: line.name })
ON CREATE SET p.born = toInteger(line.born)
MERGE (p)-[:ACTED_IN { roles:split(line.roles, ';')}]->(m)

// Directors
LOAD CSV WITH HEADERS FROM 'https://neo4j.com/docs/cypher-manual/3.5/csv/query-tuning/directors.csv' AS line
MATCH (m:Movie { title: line.title })
MERGE (p:Person { name: line.name })
ON CREATE SET p.born = toInteger(line.born)
MERGE (p)-[:DIRECTED]->(m)

Single-Node Query

// Profile lazy node query
PROFILE
MATCH (p {name: "Keanu Reeves"}) 
RETURN p

Using CYPHER 3.4, the COST planner, and COMPILED runtime, this resulted in 347 total db hits in 26 ms:

AllNodesScan: 174 db hits, 9 pagecache hits, 0 pagecache misses, 173 estimated rows, 173 rows
Filter: 173 db hits, 1356 pagecache hits, 0 pagecache misses, 17 estimated rows, 1 row
Produce Results: 0 db hits, 6 pagecache hits, 17 estimated rows, 1 row
Result

// Profile non-lazy node query
PROFILE
MATCH (p:Person {name: "Keanu Reeves"}) 
RETURN p

Using CYPHER 3.4, the COST planner, and COMPILED runtime, this resulted in 267 total db hits in 21 ms:

AllNodesScan: 134 db hits, 9 pagecache hits, 0 pagecache misses, 133 estimated rows, 133 rows
Filter: 133 db hits, 1065 pagecache hits, 0 pagecache misses, 13 estimated rows, 1 row
Produce Results: 0 db hits, 8 pagecache hits, 13 estimated rows, 1 row
Result

Shaved a whole 5 ms off! This might seem like a little bit, but our toy data set only has 173 nodes and 254 relationships.

Can we do better? Yes: add an index.

// Create an indexes and profile that again
CREATE INDEX ON :Person(name);

PROFILE
MATCH (p:Person {name: "Keanu Reeves"}) 
RETURN p

Using CYPHER 3.4, the COST planner, and COMPILED runtime, this resulted in 2 total db hits in 47 ms:

NodeIndexSeek: 2 db hits, 5 pagecache hits, 1 pagecache miss, 1 estimated row, 1 row
Produce Results: 0 db hits, 5 pagecache hits, 1 pagecache miss, 1 estimated row, 1 row
Result

The time almost doubled, though the number of db hits shrank by more 133x. The larger amount of time is likely due to having such a small data set: indexing doesn’t necessarily offer a time advantage at this scale. However, one can imagine that a much larger data set would likely see dramatic improvements in time as well (will test at end of all this by increasing data set size).

// Drop the index before next example
DROP INDEX ON :Person(name)

Multi-Node Query w/ Relationship

// Profile lazy node-relationship-node query
PROFILE
MATCH (p {name: "Tom Hanks"}) --> (m) 
WHERE m.released > 1994
RETURN m

Using CYPHER 3.4, the COST planner, and SLOTTED runtime, this resulted in 374 total db hits in 3 ms:

AllNodesScan: 174 db hits, 10 pagecache hits, 0 pagecache misses, 173 estimated rows, 173 rows
Filter: 173 db hits, 9 pagecache hits, 0 pagecache misses, 17 estimated rows, 1 row
Expand(All): 14 db hits, 9 pagecache hits, 0 pagecache misses, 25 estimated rows, 13 rows
Filter: 13 db hits, 9 pagecache hits, 0 pagecache misses, 1 estimated row, 10 rows
Produce Results: 0 db hits, 9 pagecache hits, 0 pagecache misses, 1 estimated row, 10 rows
Result

// Profile semi-non-lazy node-relationship-node query
PROFILE
MATCH (p:Person {name: "Tom Hanks"}) -[:ACTED_IN]-> (m) 
WHERE m.released > 1994
RETURN m

Using CYPHER 3.4, the COST planner, and SLOTTED runtime, this resulted in 292 total db hits in 20 ms:

AllNodesScan: 134 db hits, 8 pagecache hits, 0 pagecache misses, 133 estimated rows, 133 rows
Filter: 133 db hits, 7 pagecache hits, 0 pagecache misses, 13 estimated rows, 1 row
Expand(All): 13 db hits, 7 pagecache hits, 0 pagecache misses, 17 estimated rows, 12 rows
Filter: 12 db hits, 7 pagecache hits, 0 pagecache misses, 0 estimated row, 9 rows
Produce Results: 0 db hits, 7 pagecache hits, 0 pagecache misses, 0 estimated row, 9 rows
Result

// Profile non-lazy node-relationship-node query
//   -- that is, specify both relationship and second node label
PROFILE
MATCH (p:Person {name: "Tom Hanks"}) -[:ACTED_IN]-> (m:Movie) 
WHERE m.released > 1994
RETURN m

Using CYPHER 3.4, the COST planner, and SLOTTED runtime, this resulted in 481 total db hits in 37 ms:

AllNodesScan: 41 db hits, 23 pagecache hits, 0 pagecache misses, 40 estimated rows, 40 rows
Filter: 40 db hits, 22 pagecache hits, 0 pagecache misses, 1 estimated rows, 31 row
Expand(All): 154 db hits, 22 pagecache hits, 0 pagecache misses, 5 estimated rows, 123 rows
Filter: 246 db hits, 22 pagecache hits, 0 pagecache misses, 0 estimated row, 9 rows
Produce Results: 0 db hits, 22 pagecache hits, 0 pagecache misses, 0 estimated row, 9 rows
Result

Actually seems to have gotten worse… Mostly at the Expand(All) and subsquent Filter steps…but why? And would this trend continue onto a larger data set where performance really starts to matter?

Ok… Well, moving onto indexes…

// Create an index on :Person(name) and profile that again
CREATE INDEX ON :Person(name);

PROFILE
MATCH (p:Person {name: "Tom Hanks"}) -[:ACTED_IN]-> (m:Movie) 
WHERE m.released > 1994
RETURN m

Using CYPHER 3.4, the COST planner, and SLOTTED runtime, this resulted in 39 total db hits in 1 ms:

NodeIndexSeek: 2 db hits, 7 pagecache hits, 0 pagecache misses, 1 estimated rows, 1 rows
Expand(All): 13 db hits, 7 pagecache hits, 0 pagecache misses, 1 estimated rows, 12 rows
Filter: 24 db hits, 7 pagecache hits, 0 pagecache misses, 0 estimated row, 9 rows
Produce Results: 0 db hits, 7 pagecache hits, 0 pagecache misses, 0 estimated row, 9 rows
Result

// Drop the index on :Pereson(name) and create an index on :Movie(released) 
DROP INDEX ON :Person(name);

CREATE INDEX ON :Movie(released);

PROFILE
MATCH (p:Person {name: "Tom Hanks"}) -[:ACTED_IN]-> (m:Movie) 
WHERE m.released > 1994
RETURN m

Using CYPHER 3.4, the COST planner, and SLOTTED runtime, this resulted in 433 total db hits in 45 ms:

NodeIndexSeekByRange: 33 db hits, 18 pagecache hits, 1 pagecache misses, 1 estimated rows, 31 rows
Expand(All): 154 db hits, 18 pagecache hits, 1 pagecache misses, 5 estimated rows, 123 rows
Filter: 246 db hits, 18 pagecache hits, 1 pagecache misses, 0 estimated row, 9 rows
Produce Results: 0 db hits, 18 pagecache hits, 10 pagecache misses, 0 estimated row, 9 rows
Result

// Recreate the index on :Pereson(name) so that we now have two 
// indexes (:Person(name) and :Movie(released)) 

CREATE INDEX ON :Person(name);

PROFILE
MATCH (p:Person {name: "Tom Hanks"}) -[:ACTED_IN]-> (m:Movie) 
WHERE m.released > 1994
RETURN m

Using CYPHER 3.4, the COST planner, and SLOTTED runtime, this resulted in 379 total db hits in 29 ms:

Two starting point that converge in the next step:
- NodeIndexSeek: 3 db hits, 10 pagecache hits, 1 pagecache misses, 1 estimated rows, 1 row
- NodeIndexSeekByRange: 33 db hits, 10 pagecache hits, 0 pagecache misses, 1 estimated rows, 31 rows
CartesianProduct: 0 db hits, 10 pagecache hits, 1 pagecache misses, 1 estimated rows, 31 rows
Expand(Into): 343 db hits, 10 pagecache hits, 1 pagecache misses, 0 estimated rows, 9 rows
Produce Results: 0 db hits, 10 pagecache hits, 1 pagecache misses, 0 estimated row, 9 rows
Result

…and that second index on :Movie(released) reduced the performance of just having a single index on :Person(name). Again, not sure why this is other than there is a certain cost to using an index. That cost might not grow as fast as your data set, so for a larger data set maybe 2 indexes would be better! However, that is certainly not true on this small data set.

Moral? Not sure.

Indexes are great – sometimes
- Other times, they can increase the db hits and/or time
- Too many indexes can affect performance
Specifying labels and relationship types is important – mostly
- We did see some performance reduction on a small data set
Not sure how these lessons expand to much larger data sets…

Larger Toy Data Set

So at one point above, I promised to re-do everything on a larger data set… Well, consider that promise broken – for now! This post has gotten too long, and it seems a follow-up post would be better.

So, until next time – ta, ta!

Written on November 28, 2018