Let’s look at a few examples. We’ll use a data set containing movies, actors and roles.
Tip While you’ll usually want to store your node identifier as a property on the node for looking it up later, it’s not mandatory.
If you don’t want the identifier to be persisted then don’t specify a property name in the |
Basic example
First we’ll look at the movies.
Each movie has an id, which is used to refer to it in other data sources, a title and a year
Along with these properties we’ll also add the node labels Movie
and Sequel
.
By default the import tool expects CSV files to be comma delimited.
movies.csv
movieId:ID,title,year:int,:LABEL tt0133093,"The Matrix",1999,Movie tt0234215,"The Matrix Reloaded",2003,Movie;Sequel tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
Next up are the actors. They have an id - in this case a shorthand - and a name and all have the Actor label.
actors.csv
personId:ID,name,:LABEL keanu,"Keanu Reeves",Actor laurence,"Laurence Fishburne",Actor carrieanne,"Carrie-Anne Moss",Actor
Finally we have the roles that an actor plays in a movie which will be represented by relationships in the database.
In order to create a relationship between nodes we refer to the ids used in actors.csv
and movies.csv
in the START_ID
and END_ID
fields.
We also need to provide a relationship type (in this case ACTS_IN
) in the :TYPE
field.
roles.csv
:START_ID,role,:END_ID,:TYPE keanu,"Neo",tt0133093,ACTED_IN keanu,"Neo",tt0234215,ACTED_IN keanu,"Neo",tt0242653,ACTED_IN laurence,"Morpheus",tt0133093,ACTED_IN laurence,"Morpheus",tt0234215,ACTED_IN laurence,"Morpheus",tt0242653,ACTED_IN carrieanne,"Trinity",tt0133093,ACTED_IN carrieanne,"Trinity",tt0234215,ACTED_IN carrieanne,"Trinity",tt0242653,ACTED_IN
With all data in place, we execute the following command:
neo4j-import --into path_to_target_directory --nodes movies.csv --nodes actors.csv --relationships roles.csv
We’re now ready to start up a database from the target directory. (see Section 23.2, “Server Installation”)
Once we’ve got the database up and running we can add appropriate indexes. (see Section 3.6, “Labels, Constraints and Indexes”.)
Tip It is possible to import only nodes using the import tool - just don’t specify a relationships file when calling |
Customizing configuration options
We can customize the configuration options that the import tool uses (see the section called “Options”) if our data doesn’t fit the default format.
The following CSV files are delimited by ;
, use |
as their array delimiter and use '
for quotes.
movies2.csv
movieId:ID;title;year:int;:LABEL tt0133093;'The Matrix';1999;Movie tt0234215;'The Matrix Reloaded';2003;Movie|Sequel tt0242653;'The Matrix Revolutions';2003;Movie|Sequel
actors2.csv
personId:ID;name;:LABEL keanu;'Keanu Reeves';Actor laurence;'Laurence Fishburne';Actor carrieanne;'Carrie-Anne Moss';Actor
roles2.csv
:START_ID;role;:END_ID;:TYPE keanu;'Neo';tt0133093;ACTED_IN keanu;'Neo';tt0234215;ACTED_IN keanu;'Neo';tt0242653;ACTED_IN laurence;'Morpheus';tt0133093;ACTED_IN laurence;'Morpheus';tt0234215;ACTED_IN laurence;'Morpheus';tt0242653;ACTED_IN carrieanne;'Trinity';tt0133093;ACTED_IN carrieanne;'Trinity';tt0234215;ACTED_IN carrieanne;'Trinity';tt0242653;ACTED_IN
We can then import these files with the following command line options:
neo4j-import --into path_to_target_directory --nodes movies2.csv --nodes actors2.csv --relationships roles2.csv --delimiter ";" --array-delimiter "|" --quote "'"
Using separate header files
When dealing with very large CSV files it’s more convenient to have the header in a separate file. This makes it easier to edit the header as you avoid having to open a huge data file just to change it.
Tip
|
We’ll use the same data as in the previous example but put the headers in separate files.
movies3-header.csv
movieId:ID,title,year:int,:LABEL
movies3.csv
tt0133093,"The Matrix",1999,Movie tt0234215,"The Matrix Reloaded",2003,Movie;Sequel tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors3-header.csv
personId:ID,name,:LABEL
actors3.csv
keanu,"Keanu Reeves",Actor laurence,"Laurence Fishburne",Actor carrieanne,"Carrie-Anne Moss",Actor
roles3-header.csv
:START_ID,role,:END_ID,:TYPE
roles3.csv
keanu,"Neo",tt0133093,ACTED_IN keanu,"Neo",tt0234215,ACTED_IN keanu,"Neo",tt0242653,ACTED_IN laurence,"Morpheus",tt0133093,ACTED_IN laurence,"Morpheus",tt0234215,ACTED_IN laurence,"Morpheus",tt0242653,ACTED_IN carrieanne,"Trinity",tt0133093,ACTED_IN carrieanne,"Trinity",tt0234215,ACTED_IN carrieanne,"Trinity",tt0242653,ACTED_IN
Note how the file groups are enclosed in quotation marks in the command:
neo4j-import --into path_to_target_directory --nodes "movies3-header.csv,movies3.csv" --nodes "actors3-header.csv,actors3.csv" --relationships "roles3-header.csv,roles3.csv"
Multiple input files
As well as using a separate header file you can also provide multiple nodes or relationships files.
This may be useful when processing the output from a Hadoop pipeline for example.
Files within such an input group can be specified with multiple match strings, delimited by ,
, where each match string can be either: the exact file name or a regular expression matching one or more files.
Multiple matching files will be sorted according to their characters and their natural number sort order for file names containing numbers.
movies4-header.csv
movieId:ID,title,year:int,:LABEL
movies4-part1.csv
tt0133093,"The Matrix",1999,Movie tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
movies4-part2.csv
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors4-header.csv
personId:ID,name,:LABEL
actors4-part1.csv
keanu,"Keanu Reeves",Actor laurence,"Laurence Fishburne",Actor
actors4-part2.csv
carrieanne,"Carrie-Anne Moss",Actor
roles4-header.csv
:START_ID,role,:END_ID,:TYPE
roles4-part1.csv
keanu,"Neo",tt0133093,ACTED_IN keanu,"Neo",tt0234215,ACTED_IN keanu,"Neo",tt0242653,ACTED_IN laurence,"Morpheus",tt0133093,ACTED_IN laurence,"Morpheus",tt0234215,ACTED_IN
roles4-part2.csv
laurence,"Morpheus",tt0242653,ACTED_IN carrieanne,"Trinity",tt0133093,ACTED_IN carrieanne,"Trinity",tt0234215,ACTED_IN carrieanne,"Trinity",tt0242653,ACTED_IN
The call to neo4j-import
would look like this:
neo4j-import --into path_to_target_directory --nodes "movies4-header.csv,movies4-part1.csv,movies4-part2.csv" --nodes "actors4-header.csv,actors4-part1.csv,actors4-part2.csv" --relationships "roles4-header.csv,roles4-part1.csv,roles4-part2.csv"
Types and labels
Using the same label for every node
If you want to use the same node label(s) for every node in your nodes file you can do this by specifying the appropriate value as an option to neo4j-import
.
In this example we’ll put the label Movie
on every node specified in movies5.csv
:
movies5.csv
movieId:ID,title,year:int tt0133093,"The Matrix",1999
Tip There’s then no need to specify the |
In this case, we’ll put the labels Movie
and Sequel
on the nodes specified in sequels5.csv
.
sequels5.csv
movieId:ID,title,year:int tt0234215,"The Matrix Reloaded",2003 tt0242653,"The Matrix Revolutions",2003
actors5.csv
personId:ID,name keanu,"Keanu Reeves" laurence,"Laurence Fishburne" carrieanne,"Carrie-Anne Moss"
roles5.csv
:START_ID,role,:END_ID,:TYPE keanu,"Neo",tt0133093,ACTED_IN keanu,"Neo",tt0234215,ACTED_IN keanu,"Neo",tt0242653,ACTED_IN laurence,"Morpheus",tt0133093,ACTED_IN laurence,"Morpheus",tt0234215,ACTED_IN laurence,"Morpheus",tt0242653,ACTED_IN carrieanne,"Trinity",tt0133093,ACTED_IN carrieanne,"Trinity",tt0234215,ACTED_IN carrieanne,"Trinity",tt0242653,ACTED_IN
The call to neo4j-import
would look like this:
neo4j-import --into path_to_target_directory --nodes:Movie movies5.csv --nodes:Movie:Sequel sequels5.csv --nodes:Actor actors5.csv --relationships roles5.csv
Using the same relationship type for every relationship
If you want to use the same relationship type for every relationship in your relationships file you can do this by
specifying the appropriate value as an option to neo4j-import
.
In this example we’ll put the relationship type ACTS_IN
on every relationship specified in roles6.csv
:
movies6.csv
movieId:ID,title,year:int,:LABEL tt0133093,"The Matrix",1999,Movie tt0234215,"The Matrix Reloaded",2003,Movie;Sequel tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors6.csv
personId:ID,name,:LABEL keanu,"Keanu Reeves",Actor laurence,"Laurence Fishburne",Actor carrieanne,"Carrie-Anne Moss",Actor
roles6.csv
:START_ID,role,:END_ID keanu,"Neo",tt0133093 keanu,"Neo",tt0234215 keanu,"Neo",tt0242653 laurence,"Morpheus",tt0133093 laurence,"Morpheus",tt0234215 laurence,"Morpheus",tt0242653 carrieanne,"Trinity",tt0133093 carrieanne,"Trinity",tt0234215 carrieanne,"Trinity",tt0242653
Tip If you provide a relationship type on the command line and in the relationships file the one in the file will be applied. |
The call to neo4j-import
would look like this:
neo4j-import --into path_to_target_directory --nodes movies6.csv --nodes actors6.csv --relationships:ACTED_IN roles6.csv
Property types
The type for properties specified in nodes and relationships files is defined in the header row. (see Section 29.1, “CSV file header format”)
The following example creates a small graph containing one actor and one movie connected by an ACTED_IN
relationship.
There is a roles
property on the relationship which contains an array of the characters played by the actor in a movie.
movies7.csv
movieId:ID,title,year:int,:LABEL tt0099892,"Joe Versus the Volcano",1990,Movie
actors7.csv
personId:ID,name,:LABEL meg,"Meg Ryan",Actor
roles7.csv
:START_ID,roles:string[],:END_ID,:TYPE meg,"DeDe;Angelica Graynamore;Patricia Graynamore",tt0099892,ACTED_IN
The arguments to neo4j-import
would be the following:
neo4j-import --into path_to_target_directory --nodes movies7.csv --nodes actors7.csv --relationships roles7.csv
ID handling
Each node processed by neo4j-import
must provide a unique id.
We use this id to find the correct nodes when creating relationships.
Working with sequential or auto incrementing identifiers
The import tool makes the assumption that identifiers are unique across node files. This may not be the case for data sets which use sequential, auto incremented or otherwise colliding identifiers. Those data sets can define id spaces where identifiers are unique within their respective id space.
For example if movies and people both use sequential identifiers then we would define Movie
and Actor
id spaces.
movies8.csv
movieId:ID(Movie),title,year:int,:LABEL 1,"The Matrix",1999,Movie 2,"The Matrix Reloaded",2003,Movie;Sequel 3,"The Matrix Revolutions",2003,Movie;Sequel
actors8.csv
personId:ID(Actor),name,:LABEL 1,"Keanu Reeves",Actor 2,"Laurence Fishburne",Actor 3,"Carrie-Anne Moss",Actor
We also need to reference the appropriate id space in our relationships file so it knows which nodes to connect together:
roles8.csv
:START_ID(Actor),role,:END_ID(Movie) 1,"Neo",1 1,"Neo",2 1,"Neo",3 2,"Morpheus",1 2,"Morpheus",2 2,"Morpheus",3 3,"Trinity",1 3,"Trinity",2 3,"Trinity",3
The command line arguments would remain the same as before:
neo4j-import --into path_to_target_directory --nodes movies8.csv --nodes actors8.csv --relationships:ACTED_IN roles8.csv
Bad input data
The import tool has a threshold of how many bad entities (nodes/relationships) to tolerate and skip before failing the import.
By default 1000
bad entities are tolerated.
A bad tolerance of 0
will as an example fail the import on the first bad entity.
For more information, see the --bad-tolerance
option.
There are different types of bad input, which we will look into.
Relationships referring to missing nodes
Relationships that refer to missing node ids, either for :START_ID
or :END_ID
are considered bad relationships.
Whether or not such relationships are skipped is controlled with --skip-bad-relationships
flag which can have the values true
or false
or no value, which means true
.
Specifying false
means that any bad relationship is considered an error and will fail the import.
For more information, see the --skip-bad-relationships
option.
In the following example there is a missing emil
node referenced in the roles file.
movies9.csv
movieId:ID,title,year:int,:LABEL tt0133093,"The Matrix",1999,Movie tt0234215,"The Matrix Reloaded",2003,Movie;Sequel tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
actors9.csv
personId:ID,name,:LABEL keanu,"Keanu Reeves",Actor laurence,"Laurence Fishburne",Actor carrieanne,"Carrie-Anne Moss",Actor
roles9.csv
:START_ID,role,:END_ID,:TYPE keanu,"Neo",tt0133093,ACTED_IN keanu,"Neo",tt0234215,ACTED_IN keanu,"Neo",tt0242653,ACTED_IN laurence,"Morpheus",tt0133093,ACTED_IN laurence,"Morpheus",tt0234215,ACTED_IN laurence,"Morpheus",tt0242653,ACTED_IN carrieanne,"Trinity",tt0133093,ACTED_IN carrieanne,"Trinity",tt0234215,ACTED_IN carrieanne,"Trinity",tt0242653,ACTED_IN emil,"Emil",tt0133093,ACTED_IN
The command line arguments would remain the same as before:
neo4j-import --into path_to_target_directory --nodes movies9.csv --nodes actors9.csv --relationships roles9.csv
Since there was only one bad relationship the import process will complete successfully and a not-imported.bad
file will be created and populated with the bad relationships.
not-imported.bad
InputRelationship: source: roles9.csv:11 properties: [role, Emil] startNode: emil endNode: tt0133093 type: ACTED_IN refering to missing node emil
Multiple nodes with same id within same id space
Nodes that specify :ID
which has already been specified within the id space are considered bad nodes.
Whether or not such nodes are skipped is controlled with --skip-duplicate-nodes
flag which can have the values true
or false
or no value, which means true
.
Specifying false
means that any duplicate node is considered an error and will fail the import.
For more information, see the --skip-duplicate-nodes
option.
In the following example there is a node id that is specified twice within the same id space.
actors10.csv
personId:ID,name,:LABEL keanu,"Keanu Reeves",Actor laurence,"Laurence Fishburne",Actor carrieanne,"Carrie-Anne Moss",Actor laurence,"Laurence Harvey",Actor
neo4j-import --into path_to_target_directory --nodes actors10.csv --skip-duplicate-nodes
Since there was only one bad node the import process will complete successfully and a not-imported.bad
file will be created and populated with the bad node.
not-imported.bad
Id 'laurence' is defined more than once in global id space, at least at actors10.csv:3 and actors10.csv:5