This tutorial will show you how to import data from CSV files using LOAD CSV
.
In this example, we’re given three CSV files: a list of persons, a list of movies, and a list of which role was played by some of these persons in each movie.
CSV files can be stored on the database server and are then accessible using a file://
URL.
Alternatively, LOAD CSV
also supports accessing CSV files via HTTPS
, HTTP
, and FTP
.
LOAD CSV
will follow HTTP
redirects but for security reasons it will not follow redirects that changes the protocol, for example if the redirect is going from HTTPS
to HTTP
.
For more details, see Section 11.6, “Load CSV”.
Using the following Cypher queries, we’ll create a node for each person, a node for each movie and a relationship between the two with a property denoting the role. We’re also keeping track of the country in which each movie was made.
Let’s start with importing the persons:
LOAD CSV WITH HEADERS FROM "http://neo4j.com/docs/3.1.0-SNAPSHOT/csv/import/persons.csv" AS csvLine CREATE (p:Person { id: toInt(csvLine.id), name: csvLine.name })
The CSV file we’re using looks like this:
persons.csv
id,name 1,Charlie Sheen 2,Oliver Stone 3,Michael Douglas 4,Martin Sheen 5,Morgan Freeman
Now, let’s import the movies. This time, we’re also creating a relationship to the country in which the movie was made. If you are storing your data in a SQL database, this is the one-to-many relationship type.
We’re using MERGE
to create nodes that represent countries.
Using MERGE
avoids creating duplicate country nodes in the case where multiple movies have been made in the same country.
Important When using |
Before running our query to connect movies and countries we’ll create an index for the name property on the Country label to ensure the query runs as fast as it can:
CREATE INDEX ON :Country(name)
LOAD CSV WITH HEADERS FROM "http://neo4j.com/docs/3.1.0-SNAPSHOT/csv/import/movies.csv" AS csvLine MERGE (country:Country { name: csvLine.country }) CREATE (movie:Movie { id: toInt(csvLine.id), title: csvLine.title, year:toInt(csvLine.year)}) CREATE (movie)-[:MADE_IN]->(country)
movies.csv
id,title,country,year 1,Wall Street,USA,1987 2,The American President,USA,1995 3,The Shawshank Redemption,USA,1994
Lastly, we create the relationships between the persons and the movies. Since the relationship is a many to many relationship, one actor can participate in many movies, and one movie has many actors in it. We have this data in a separate file.
We’ll index the id property on Person and Movie nodes.
The id property is a temporary property used to look up the appropriate nodes for a relationship when importing the third file.
By indexing the id property, node lookup (e.g. by MATCH
) will be much faster.
Since we expect the ids to be unique in each set, we’ll create a unique constraint.
This protects us from invalid data since constraint creation will fail if there are multiple nodes with the same id property.
Creating a unique constraint also creates a unique index (which is faster than a regular index).
CREATE CONSTRAINT ON (person:Person) ASSERT person.id IS UNIQUE
CREATE CONSTRAINT ON (movie:Movie) ASSERT movie.id IS UNIQUE
Now importing the relationships is a matter of finding the nodes and then creating relationships between them.
For this query we’ll use USING PERIODIC COMMIT
(see Section 12.9, “Using Periodic Commit”) which is helpful for queries that operate on large CSV files.
This hint tells Neo4j that the query might build up inordinate amounts of transaction state, and so needs to be periodically committed.
In this case we also set the limit to 500
rows per commit.
USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "http://neo4j.com/docs/3.1.0-SNAPSHOT/csv/import/roles.csv" AS csvLine MATCH (person:Person { id: toInt(csvLine.personId)}),(movie:Movie { id: toInt(csvLine.movieId)}) CREATE (person)-[:PLAYED { role: csvLine.role }]->(movie)
roles.csv
personId,movieId,role 1,1,Bud Fox 4,1,Carl Fox 3,1,Gordon Gekko 4,2,A.J. MacInerney 3,2,President Andrew Shepherd 5,3,Ellis Boyd 'Red' Redding
Finally, as the id
property was only necessary to import the relationships, we can drop the constraints and the id property from all movie and person nodes.
DROP CONSTRAINT ON (person:Person) ASSERT person.id IS UNIQUE
DROP CONSTRAINT ON (movie:Movie) ASSERT movie.id IS UNIQUE
MATCH (n) WHERE n:Person OR n:Movie REMOVE n.id