29.3. Import tool examples

Let’s look at a few examples. We’ll use a data set containing movies, actors and roles.

[Tip]Tip

While you’ll usually want to store your node identifier as a property on the node for looking it up later, it’s not mandatory. If you don’t want the identifier to be persisted then don’t specify a property name in the :ID field.

Basic example

First we’ll look at the movies. Each movie has an id, which is used to refer to it in other data sources, a title and a year Along with these properties we’ll also add the node labels Movie and Sequel.

By default the import tool expects CSV files to be comma delimited.

movies.csv 

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

Next up are the actors. They have an id - in this case a shorthand - and a name and all have the Actor label.

actors.csv 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

Finally we have the roles that an actor plays in a movie which will be represented by relationships in the database. In order to create a relationship between nodes we refer to the ids used in actors.csv and movies.csv in the START_ID and END_ID fields. We also need to provide a relationship type (in this case ACTS_IN) in the :TYPE field.

roles.csv 

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

With all data in place, we execute the following command:

neo4j-import --into path_to_target_directory --nodes movies.csv --nodes actors.csv --relationships roles.csv

We’re now ready to start up a database from the target directory. (see Section 23.2, “Server Installation”)

Once we’ve got the database up and running we can add appropriate indexes. (see Section 3.6, “Labels, Constraints and Indexes”.)

[Tip]Tip

It is possible to import only nodes using the import tool - just don’t specify a relationships file when calling neo4j-import. If you do this you’ll need to create relationships later by another method - the import tool only works for initial graph population.

Customizing configuration options

We can customize the configuration options that the import tool uses (see the section called “Options”) if our data doesn’t fit the default format. The following CSV files are delimited by ;, use | as their array delimiter and use ' for quotes.

movies2.csv 

movieId:ID;title;year:int;:LABEL
tt0133093;'The Matrix';1999;Movie
tt0234215;'The Matrix Reloaded';2003;Movie|Sequel
tt0242653;'The Matrix Revolutions';2003;Movie|Sequel

actors2.csv 

personId:ID;name;:LABEL
keanu;'Keanu Reeves';Actor
laurence;'Laurence Fishburne';Actor
carrieanne;'Carrie-Anne Moss';Actor

roles2.csv 

:START_ID;role;:END_ID;:TYPE
keanu;'Neo';tt0133093;ACTED_IN
keanu;'Neo';tt0234215;ACTED_IN
keanu;'Neo';tt0242653;ACTED_IN
laurence;'Morpheus';tt0133093;ACTED_IN
laurence;'Morpheus';tt0234215;ACTED_IN
laurence;'Morpheus';tt0242653;ACTED_IN
carrieanne;'Trinity';tt0133093;ACTED_IN
carrieanne;'Trinity';tt0234215;ACTED_IN
carrieanne;'Trinity';tt0242653;ACTED_IN

We can then import these files with the following command line options:

neo4j-import --into path_to_target_directory --nodes movies2.csv --nodes actors2.csv --relationships roles2.csv --delimiter ";" --array-delimiter "|" --quote "'"

Using separate header files

When dealing with very large CSV files it’s more convenient to have the header in a separate file. This makes it easier to edit the header as you avoid having to open a huge data file just to change it.

[Tip]Tip

import-tool can also process single file compressed archives. e.g. --nodes nodes.csv.gz or --relationships rels.zip

We’ll use the same data as in the previous example but put the headers in separate files.

movies3-header.csv 

movieId:ID,title,year:int,:LABEL

movies3.csv 

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors3-header.csv 

personId:ID,name,:LABEL

actors3.csv 

keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles3-header.csv 

:START_ID,role,:END_ID,:TYPE

roles3.csv 

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

Note how the file groups are enclosed in quotation marks in the command:

neo4j-import --into path_to_target_directory --nodes "movies3-header.csv,movies3.csv" --nodes "actors3-header.csv,actors3.csv" --relationships "roles3-header.csv,roles3.csv"

Multiple input files

As well as using a separate header file you can also provide multiple nodes or relationships files. This may be useful when processing the output from a Hadoop pipeline for example. Files within such an input group can be specified with multiple match strings, delimited by ,, where each match string can be either: the exact file name or a regular expression matching one or more files. Multiple matching files will be sorted according to their characters and their natural number sort order for file names containing numbers.

movies4-header.csv 

movieId:ID,title,year:int,:LABEL

movies4-part1.csv 

tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel

movies4-part2.csv 

tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors4-header.csv 

personId:ID,name,:LABEL

actors4-part1.csv 

keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor

actors4-part2.csv 

carrieanne,"Carrie-Anne Moss",Actor

roles4-header.csv 

:START_ID,role,:END_ID,:TYPE

roles4-part1.csv 

keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN

roles4-part2.csv 

laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

The call to neo4j-import would look like this:

neo4j-import --into path_to_target_directory --nodes "movies4-header.csv,movies4-part1.csv,movies4-part2.csv" --nodes "actors4-header.csv,actors4-part1.csv,actors4-part2.csv" --relationships "roles4-header.csv,roles4-part1.csv,roles4-part2.csv"

Types and labels

Using the same label for every node

If you want to use the same node label(s) for every node in your nodes file you can do this by specifying the appropriate value as an option to neo4j-import. In this example we’ll put the label Movie on every node specified in movies5.csv:

movies5.csv 

movieId:ID,title,year:int
tt0133093,"The Matrix",1999

[Tip]Tip

There’s then no need to specify the :LABEL field in the node file if you pass it as a command line option. If you do then both the label provided in the file and the one provided on the command line will be added to the node.

In this case, we’ll put the labels Movie and Sequel on the nodes specified in sequels5.csv.

sequels5.csv 

movieId:ID,title,year:int
tt0234215,"The Matrix Reloaded",2003
tt0242653,"The Matrix Revolutions",2003

actors5.csv 

personId:ID,name
keanu,"Keanu Reeves"
laurence,"Laurence Fishburne"
carrieanne,"Carrie-Anne Moss"

roles5.csv 

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN

The call to neo4j-import would look like this:

neo4j-import --into path_to_target_directory --nodes:Movie movies5.csv --nodes:Movie:Sequel sequels5.csv --nodes:Actor actors5.csv --relationships roles5.csv

Using the same relationship type for every relationship

If you want to use the same relationship type for every relationship in your relationships file you can do this by specifying the appropriate value as an option to neo4j-import. In this example we’ll put the relationship type ACTS_IN on every relationship specified in roles6.csv:

movies6.csv 

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors6.csv 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles6.csv 

:START_ID,role,:END_ID
keanu,"Neo",tt0133093
keanu,"Neo",tt0234215
keanu,"Neo",tt0242653
laurence,"Morpheus",tt0133093
laurence,"Morpheus",tt0234215
laurence,"Morpheus",tt0242653
carrieanne,"Trinity",tt0133093
carrieanne,"Trinity",tt0234215
carrieanne,"Trinity",tt0242653

[Tip]Tip

If you provide a relationship type on the command line and in the relationships file the one in the file will be applied.

The call to neo4j-import would look like this:

neo4j-import --into path_to_target_directory --nodes movies6.csv --nodes actors6.csv --relationships:ACTED_IN roles6.csv

Property types

The type for properties specified in nodes and relationships files is defined in the header row. (see Section 29.1, “CSV file header format”)

The following example creates a small graph containing one actor and one movie connected by an ACTED_IN relationship. There is a roles property on the relationship which contains an array of the characters played by the actor in a movie.

movies7.csv 

movieId:ID,title,year:int,:LABEL
tt0099892,"Joe Versus the Volcano",1990,Movie

actors7.csv 

personId:ID,name,:LABEL
meg,"Meg Ryan",Actor

roles7.csv 

:START_ID,roles:string[],:END_ID,:TYPE
meg,"DeDe;Angelica Graynamore;Patricia Graynamore",tt0099892,ACTED_IN

The arguments to neo4j-import would be the following:

neo4j-import --into path_to_target_directory --nodes movies7.csv --nodes actors7.csv --relationships roles7.csv

ID handling

Each node processed by neo4j-import must provide a unique id. We use this id to find the correct nodes when creating relationships.

Working with sequential or auto incrementing identifiers

The import tool makes the assumption that identifiers are unique across node files. This may not be the case for data sets which use sequential, auto incremented or otherwise colliding identifiers. Those data sets can define id spaces where identifiers are unique within their respective id space.

For example if movies and people both use sequential identifiers then we would define Movie and Actor id spaces.

movies8.csv 

movieId:ID(Movie),title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel

actors8.csv 

personId:ID(Actor),name,:LABEL
1,"Keanu Reeves",Actor
2,"Laurence Fishburne",Actor
3,"Carrie-Anne Moss",Actor

We also need to reference the appropriate id space in our relationships file so it knows which nodes to connect together:

roles8.csv 

:START_ID(Actor),role,:END_ID(Movie)
1,"Neo",1
1,"Neo",2
1,"Neo",3
2,"Morpheus",1
2,"Morpheus",2
2,"Morpheus",3
3,"Trinity",1
3,"Trinity",2
3,"Trinity",3

The command line arguments would remain the same as before:

neo4j-import --into path_to_target_directory --nodes movies8.csv --nodes actors8.csv --relationships:ACTED_IN roles8.csv

Bad input data

The import tool has a threshold of how many bad entities (nodes/relationships) to tolerate and skip before failing the import. By default 1000 bad entities are tolerated. A bad tolerance of 0 will as an example fail the import on the first bad entity. For more information, see the --bad-tolerance option.

There are different types of bad input, which we will look into.

Relationships referring to missing nodes

Relationships that refer to missing node ids, either for :START_ID or :END_ID are considered bad relationships. Whether or not such relationships are skipped is controlled with --skip-bad-relationships flag which can have the values true or false or no value, which means true. Specifying false means that any bad relationship is considered an error and will fail the import. For more information, see the --skip-bad-relationships option.

In the following example there is a missing emil node referenced in the roles file.

movies9.csv 

movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel

actors9.csv 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

roles9.csv 

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
emil,"Emil",tt0133093,ACTED_IN

The command line arguments would remain the same as before:

neo4j-import --into path_to_target_directory --nodes movies9.csv --nodes actors9.csv --relationships roles9.csv

Since there was only one bad relationship the import process will complete successfully and a not-imported.bad file will be created and populated with the bad relationships.

not-imported.bad 

InputRelationship:
   source: roles9.csv:11
   properties: [role, Emil]
   startNode: emil
   endNode: tt0133093
   type: ACTED_IN
 refering to missing node emil

Multiple nodes with same id within same id space

Nodes that specify :ID which has already been specified within the id space are considered bad nodes. Whether or not such nodes are skipped is controlled with --skip-duplicate-nodes flag which can have the values true or false or no value, which means true. Specifying false means that any duplicate node is considered an error and will fail the import. For more information, see the --skip-duplicate-nodes option.

In the following example there is a node id that is specified twice within the same id space.

actors10.csv 

personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
laurence,"Laurence Harvey",Actor

neo4j-import --into path_to_target_directory --nodes actors10.csv --skip-duplicate-nodes

Since there was only one bad node the import process will complete successfully and a not-imported.bad file will be created and populated with the bad node.

not-imported.bad 

Id 'laurence' is defined more than once in global id space, at least at actors10.csv:3 and actors10.csv:5