For general notes on batch insertion, see Chapter 35, Batch Insertion.
Indexing during batch insertion is done using BatchInserterIndex which are provided via BatchInserterIndexProvider. An example:
BatchInserter inserter = BatchInserters.inserter( file ); BatchInserterIndexProvider indexProvider = new LuceneBatchInserterIndexProvider( inserter ); BatchInserterIndex actors = indexProvider.nodeIndex( "actors", MapUtil.stringMap( "type", "exact" ) ); actors.setCacheCapacity( "name", 100000 ); Map<String, Object> properties = MapUtil.map( "name", "Keanu Reeves" ); long node = inserter.createNode( properties ); actors.add( node, properties ); //make the changes visible for reading, use this sparsely, requires IO! actors.flush(); // Make sure to shut down the index provider as well indexProvider.shutdown(); inserter.shutdown();
The configuration parameters are the same as mentioned in Section 34.10, “Configuration and fulltext indexes”.
Best practices
Here are some pointers to get the most performance out of BatchInserterIndex
:
- Try to avoid flushing too often because each flush will result in all additions (since last flush) to be visible to the querying methods, and publishing those changes can be a performance penalty.
- Have (as big as possible) phases where one phase is either only writes or only reads, and don’t forget to flush after a write phase so that those changes becomes visible to the querying methods.
- Enable caching for keys you know you’re going to do lookups for later on to increase performance significantly (though insertion performance may degrade slightly).
Note Changes to the index are available for reading first after they are flushed to disk. Thus, for optimal performance, read and lookup operations should be kept to a minimum during batchinsertion since they involve IO and impact speed negatively. |