Linux
Under Unix/Linux/OSX, the command is named neo4j-import
. Depending on the installation type, the tool is either available globally, or used by executing ./bin/neo4j-import
from inside the installation directory.
Windows
Under Windows, used by executing bin\neo4j-import
from inside the installation directory.
For help with running the import tool under Windows, see the reference in Windows.
Options
- --into <store-dir>
- Database directory to import into. Must not contain existing database.
- --nodes[:Label1:Label2] "<file1>,<file2>,…"
- Node CSV header and data. Multiple files will be logically seen as one big file from the perspective of the importer. The first line must contain the header. Multiple data sources like these can be specified in one import, where each data source has its own header. Note that file groups must be enclosed in quotation marks.
- --relationships[:RELATIONSHIP_TYPE] "<file1>,<file2>,…"
- Relationship CSV header and data. Multiple files will be logically seen as one big file from the perspective of the importer. The first line must contain the header. Multiple data sources like these can be specified in one import, where each data source has its own header. Note that file groups must be enclosed in quotation marks.
- --delimiter <delimiter-character>
-
Delimiter character, or TAB, between values in CSV data. The default option is
,
.
- --array-delimiter <array-delimiter-character>
-
Delimiter character, or TAB, between array elements within a value in CSV data. The default option is
;
.
- --quote <quotation-character>
-
Character to treat as quotation character for values in CSV data. The default option is
"
. Quotes inside quotes escaped like"""Go away"", he said."
and"\"Go away\", he said."
are supported. If you have set "'
" to be used as the quotation character, you could write the previous example like this instead:'"Go away", he said.'
- --multiline-fields <true/false>
- Whether or not fields from input source can span multiple lines, i.e. contain newline characters. Default value: false
- --trim-strings <true/false>
- Whether or not strings should be trimmed for whitespaces. Default value: false
- --input-encoding <character set>
- Character set that input data is encoded in. Provided value must be one out of the available character sets in the JVM, as provided by Charset#availableCharsets(). If no input encoding is provided, the default character set of the JVM will be used.
- --ignore-empty-strings <true/false>
- Whether or not empty string fields, i.e. "" from input source are ignored, i.e. treated as null. Default value: false
- --id-type <id-type>
-
One out of [STRING, INTEGER, ACTUAL] and specifies how ids in node/relationship input files are treated.
STRING: arbitrary strings for identifying nodes.
INTEGER: arbitrary integer values for identifying nodes.
ACTUAL: (advanced) actual node ids. The default option is
STRING
. Default value: STRING
- --processors <max processor count>
- (advanced) Max number of processors used by the importer. Defaults to the number of available processors reported by the JVM. There is a certain amount of minimum threads needed so for that reason there is no lower bound for this value. For optimal performance this value shouldn’t be greater than the number of available processors.
- --stacktrace <true/false>
- Enable printing of error stack traces.
- --bad-tolerance <max number of bad entries>
- Number of bad entries before the import is considered failed. This tolerance threshold is about relationships refering to missing nodes. Format errors in input data are still treated as errors. Default value: 1000
- --skip-bad-relationships <true/false>
- Whether or not to skip importing relationships that refers to missing node ids, i.e. either start or end node id/group referring to node that wasn’t specified by the node input data. Skipped nodes will be logged, containing at most number of entites specified by bad-tolerance. Default value: true
- --skip-duplicate-nodes <true/false>
- Whether or not to skip importing nodes that have the same id/group. In the event of multiple nodes within the same group having the same id, the first encountered will be imported whereas consecutive such nodes will be skipped. Skipped nodes will be logged, containing at most number of entities specified by bad-tolerance. Default value: false
- --ignore-extra-columns <true/false>
- Whether or not to ignore extra columns in the data not specified by the header. Skipped columns will be logged, containing at most number of entities specified by bad-tolerance. Default value: false
- --db-config <path/to/neo4j.properties>
- (advanced) File specifying database-specific configuration. For more information consult manual about available configuration options for a neo4j configuration file. Only configuration affecting store at time of creation will be read. Examples of supported config are: dbms.relationship_grouping_threshold unsupported.dbms.block_size.strings unsupported.dbms.block_size.array_properties
Output and statistics
While an import is running through its different stages, some statistics and figures are printed in the console.
The general interpretation of that output is to look at the horizontal line, which is divided up into sections, each section representing one type of work going on in parallel with the other sections.
The wider a section is, the more time is spent there relative to the other sections, the widest being the bottleneck, also marked with *
.
If a section has a double line, instead of just a single line, it means that multiple threads are executing the work in that section.
To the far right a number is displayed telling how many entities (nodes or relationships) have been processed by that stage.
As an example:
[*>:20,25 MB/s------------------|PREPARE(3)====================|RELATIONSHIP(2)===============] 16M
Would be interpreted as:
-
>
data being read, and perhaps parsed, at20,25 MB/s
, data that is being passed on to … -
PREPARE
preparing the data for … -
RELATIONSHIP
creating actual relationship records and … -
v
writing the relationships to the store. This step isn’t visible in this example, because it’s so cheap compared to the other sections.
Observing the section sizes can give hints about where performance can be improved.
In the example above, the bottleneck is the data read section (marked with >
), which might indicate that the disk is being slow, or is poorly handling simultaneous read and write operations (since the last section often revolves around writing to disk).
Verbose error information
In some cases if an unexpected error occurs it might be useful to supply the command line option --stacktrace
to the import (and rerun the import to actually see the additional information).
This will have the error printed with additional debug information, useful for both developers and issue reporting.