If you have used SQL
and want to learn Cypher, this chapter is for you!
We won’t dig very deep into either of the languages, but focus on bridging the gap.
Data Model
For our example, we will use data about persons who act in, direct, produce movies.
Here’s an entity-relationship model for the example:
We have Person
and Movie
entities, which are related in three different ways, each of which have many-to-many cardinality.
In a RDBMS we would use tables for the entities as well as for the associative entities (join tables) needed.
In this case we decided to go with the following tables:
movie
, person
, acted_in
, directed
, produced
.
You’ll find the SQL for this below.
In Neo4j, the basic data units are nodes and relationships. Both can have properties, which correspond to attributes in a RDBMS.
Nodes can be grouped by putting labels on them.
In the example, we will use the labels Movie
and Person
.
When using Neo4j, related entities can be represented directly by using relationships.
There’s no need to deal with foreign keys to handle the relationships, the database will take care of such mechanics.
Also, the relationships always have full referential integrity.
There are no constraints to enable for this, as it’s not optional; it’s really part of the underlying data model.
Relationships always have a type, and we will differentiate the different kinds of relationships by using the types ACTED_IN
, DIRECTED
, PRODUCED
.
Sample Data
First off, let’s see how to set up our example data in a RDBMS. We’ll start out creating a few tables and then go on to populate them.
CREATE TABLE movie ( id INTEGER, title VARCHAR(100), released INTEGER, tagline VARCHAR(100) ); CREATE TABLE person ( id INTEGER, name VARCHAR(100), born INTEGER ); CREATE TABLE acted_in ( role varchar(100), person_id INTEGER, movie_id INTEGER ); CREATE TABLE directed ( person_id INTEGER, movie_id INTEGER ); CREATE TABLE produced ( person_id INTEGER, movie_id INTEGER );
Populating with data:
INSERT INTO movie (id, title, released, tagline) VALUES ( (1, 'The Matrix', 1999, 'Welcome to the Real World'), (2, 'The Devil''s Advocate', 1997, 'Evil has its winning ways'), (3, 'Monster', 2003, 'The first female serial killer of America') ); INSERT INTO person (id, name, born) VALUES ( (1, 'Keanu Reeves', 1964), (2, 'Carrie-Anne Moss', 1967), (3, 'Laurence Fishburne', 1961), (4, 'Hugo Weaving', 1960), (5, 'Andy Wachowski', 1967), (6, 'Lana Wachowski', 1965), (7, 'Joel Silver', 1952), (8, 'Charlize Theron', 1975), (9, 'Al Pacino', 1940), (10, 'Taylor Hackford', 1944) ); INSERT INTO acted_in (role, person_id, movie_id) VALUES ( ('Neo', 1, 1), ('Trinity', 2, 1), ('Morpheus', 3, 1), ('Agent Smith', 4, 1), ('Kevin Lomax', 1, 2), ('Mary Ann Lomax', 8, 2), ('John Milton', 9, 2), ('Aileen', 8, 3) ); INSERT INTO directed (person_id, movie_id) VALUES ( (5, 1), (6, 1), (10, 2) ); INSERT INTO produced (person_id, movie_id) VALUES ( (7, 1), (8, 3) );
Doing this in Neo4j will look quite different. To begin with, we won’t create any schema up front. We’ll come back to schema later, for now it’s enough to know that labels can be used right away without declaring them.
In the CREATE
statements below, we tell Neo4j what data we want to have in the graph.
Simply put, the parentheses denote nodes, while the arrows (-->
, or in our case with a relationship type included -[:DIRECTED]->
) denote relationships.
For the nodes we set variables like TheMatrix
so we can easily refer to them later on in the statement.
Note that the variables are scoped to the statement, and not visible to other Cypher statements.
We could use variables for the relationships as well, but there’s no need for that in this case.
CREATE (TheMatrix:Movie { title:'The Matrix', released:1999, tagline:'Welcome to the Real World' }) CREATE (Keanu:Person { name:'Keanu Reeves', born:1964 }) CREATE (Carrie:Person { name:'Carrie-Anne Moss', born:1967 }) CREATE (Laurence:Person { name:'Laurence Fishburne', born:1961 }) CREATE (Hugo:Person { name:'Hugo Weaving', born:1960 }) CREATE (AndyW:Person { name:'Andy Wachowski', born:1967 }) CREATE (LanaW:Person { name:'Lana Wachowski', born:1965 }) CREATE (JoelS:Person { name:'Joel Silver', born:1952 }) CREATE (Keanu)-[:ACTED_IN { roles: ['Neo']}]->(TheMatrix), (Carrie)-[:ACTED_IN { roles: ['Trinity']}]->(TheMatrix), (Laurence)-[:ACTED_IN { roles: ['Morpheus']}]->(TheMatrix), (Hugo)-[:ACTED_IN { roles: ['Agent Smith']}]->(TheMatrix),(AndyW)-[:DIRECTED]->(TheMatrix), (LanaW)-[:DIRECTED]->(TheMatrix),(JoelS)-[:PRODUCED]->(TheMatrix) CREATE (TheDevilsAdvocate:Movie { title:"The Devil's Advocate", released:1997, tagline: 'Evil has its winning ways' }) CREATE (Monster:Movie { title: 'Monster', released: 2003, tagline: 'The first female serial killer of America' }) CREATE (Charlize:Person { name:'Charlize Theron', born:1975 }) CREATE (Al:Person { name:'Al Pacino', born:1940 }) CREATE (Taylor:Person { name:'Taylor Hackford', born:1944 }) CREATE (Keanu)-[:ACTED_IN { roles: ['Kevin Lomax']}]->(TheDevilsAdvocate), (Charlize)-[:ACTED_IN { roles: ['Mary Ann Lomax']}]->(TheDevilsAdvocate), (Al)-[:ACTED_IN { roles: ['John Milton']}]->(TheDevilsAdvocate), (Taylor)-[:DIRECTED]->(TheDevilsAdvocate),(Charlize)-[:ACTED_IN { roles: ['Aileen']}]->(Monster), (Charlize)-[:PRODUCED { roles: ['Aileen']}]->(Monster)
Simple read of data
Let’s find all entries in the movie
table and output their title
attribute in our RDBMS:
SELECT movie.title FROM movie;
TITLE |
---|
3 rows |
|
|
|
Using Neo4j, find all nodes labeled Movie
and output their title
property:
MATCH (movie:Movie) RETURN movie.title;
movie.title |
---|
3 rows |
|
|
|
MATCH
tells Neo4j to match a pattern in the graph.
In this case the pattern is very simple: any node with a Movie
label on it.
We bind the result of the pattern matching to the variable movie
, for use in the RETURN
clause.
And as you can see, the RETURN
keyword of Cypher is similar to SELECT
in SQL.
Now let’s get movies released after 1998.
SELECT movie.title FROM movie WHERE movie.released > 1998;
TITLE |
---|
2 rows |
|
|
In this case the addition actually looks identical in Cypher.
MATCH (movie:Movie) WHERE movie.released > 1998 RETURN movie.title;
movie.title |
---|
2 rows |
|
|
Note however that the semantics of WHERE
in Cypher is somewhat different, see Section 11.3, “Where” for more information.
Join
Let’s list all persons and the movies they acted in.
SELECT person.name, movie.title FROM person JOIN acted_in AS acted_in ON acted_in.person_id = person.id JOIN movie ON acted_in.movie_id = movie.id;
NAME | TITLE |
---|---|
8 rows | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The same using Cypher:
MATCH (person:Person)-[:ACTED_IN]->(movie:Movie) RETURN person.name, movie.title;
Here we match a Person
and a Movie
node, in case they are connected with an ACTED_IN
relationship.
person.name | movie.title |
---|---|
8 rows | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To make things slightly more complex, let’s search for the co-actors of Keanu Reeves.
In SQL we use a self join on the person
table and join on the acted_in
table once for Keanu, and once for the co-actors.
SELECT DISTINCT co_actor.name FROM person AS keanu JOIN acted_in AS acted_in1 ON acted_in1.person_id = keanu.id JOIN acted_in AS acted_in2 ON acted_in2.movie_id = acted_in1.movie_id JOIN person AS co_actor ON acted_in2.person_id = co_actor.id AND co_actor.id <> keanu.id WHERE keanu.name = 'Keanu Reeves';
NAME |
---|
5 rows |
|
|
|
|
|
In Cypher, we use a pattern with two paths that target the same Movie
node.
MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie),(coActor:Person)-[:ACTED_IN]->(movie) WHERE keanu.name = 'Keanu Reeves' RETURN DISTINCT coActor.name;
You may have noticed that we used the co_actor.id <> keanu.id
predicate in SQL only.
This is because Neo4j will only match on the ACTED_IN
relationship once in the same pattern.
If this is not what we want, we can split the pattern up by using two MATCH
clauses like this:
MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie) MATCH (coActor:Person)-[:ACTED_IN]->(movie) WHERE keanu.name = 'Keanu Reeves' RETURN DISTINCT coActor.name;
This time Keanu Reeves is included in the result as well:
coActor.name |
---|
6 rows |
|
|
|
|
|
|
Next, let’s find out who has both acted in and produced movies.
SELECT person.name FROM person WHERE person.id IN (SELECT person_id FROM acted_in) AND person.id IN (SELECT person_id FROM produced)
NAME |
---|
1 rows |
|
In Cypher, we use patterns as predicates in this case. That is, we require the relationships to exist, but don’t care about the connected nodes; thus the empty parentheses.
MATCH (person:Person) WHERE (person)-[:ACTED_IN]->() AND (person)-[:PRODUCED]->() RETURN person.name
Aggregation
Now let’s find out a bit about the directors in movies that Keanu Reeves acted in. We want to know how many of those movies each of them directed.
SELECT director.name, count(*) FROM person keanu JOIN acted_in ON keanu.id = acted_in.person_id JOIN directed ON acted_in.movie_id = directed.movie_id JOIN person AS director ON directed.person_id = director.id WHERE keanu.name = 'Keanu Reeves' GROUP BY director.name ORDER BY count(*) DESC
NAME | C2 |
---|---|
3 rows | |
|
|
|
|
|
|
Here’s how we’ll do the same in Cypher:
MATCH (keanu:Person { name: 'Keanu Reeves' })-[:ACTED_IN]->(movie:Movie), (director:Person)-[:DIRECTED]->(movie) RETURN director.name, count(*) ORDER BY count(*) DESC
As you can see there is no GROUP BY
in the Cypher equivalent.
Instead, Neo4j will automatically figure out the grouping key.