NEO400J Level Full-Text Index Construction Optimization
If NEO4J-based full-text retrieval is used as the main entry point for the map, it is critical to optimize the map search engine.
1. Scale of data volume (level 100 million)
count(relationships):500584016
count(nodes):765485810
2. Ways to Build Indexes
Use the script background server to perform the operation of building a full-text index.
Use a background script to execute the indexer:
index.sh #!/usr/bin/env bash nohup /neo4j-community-3.4.9/bin/neo4j-shell -file build.cql >>indexGraph.log 2>&1 &
build.cql CALL zdr.index.addChineseFulltextIndex('IKAnalyzer', ['description','fullname','name','lnkurl'], 'LinkedinID') YIELD message RETURN message;
3. Exceptions Occurring in Index Building
ERROR (-v for expanded information): TransactionFailureException: The database has encountered a critical error, and needs to be restarted. Please see database logs for more details. -host Domain name or IP of host to connect to (default: localhost) -port Port of host to connect to (default: 1337) -name RMI name, i.e. rmi://<host>:<port>/<name> (default: shell) -pid Process ID to connect to -c Command line to execute. After executing it the shell exits -file File containing commands to execute, or '-' to read from stdin. After executing it the shell exits -readonly Connect in readonly mode (only for connecting with -path) -path Points to a neo4j db path so that a local server can be started there -config Points to a config file when starting a local server Example arguments for remote: -port 1337 -host 192.168.1.234 -port 1337 -name shell -host localhost -readonly ...or no arguments for default values Example arguments for local: -path /path/to/db -path /path/to/db -config /path/to/neo4j.config -path /path/to/db -readonly Caused by: java.lang.OutOfMemoryError: Java heap space | GB+Tree[file:/u02/isi/zdr/graph/neo4j-community-3.4.9/data/databases/graph.db/schema/index/lucene_native-2.0/134/string-1.0/index-134, layout:StringLayout[version:0.1, identifier:24016946018123776], generation:16587/16588] at org.neo4j.io.pagecache.impl.muninn.CursorFactory.takeWriteCursor(CursorFactory.java:62) at org.neo4j.io.pagecache.impl.muninn.MuninnPagedFile.io(MuninnPagedFile.java:186) at org.neo4j.index.internal.gbptree.FreeListIdProvider.releaseId(FreeListIdProvider.java:217) at org.neo4j.index.internal.gbptree.InternalTreeLogic.createSuccessorIfNeeded(InternalTreeLogic.java:1289) at org.neo4j.index.internal.gbptree.InternalTreeLogic.insertInLeaf(InternalTreeLogic.java:513) at org.neo4j.index.internal.gbptree.InternalTreeLogic.insert(InternalTreeLogic.java:356) at org.neo4j.index.internal.gbptree.GBPTree$SingleWriter.merge(GBPTree.java:1234) at org.neo4j.kernel.impl.index.schema.NativeSchemaIndexUpdater.processAdd(NativeSchemaIndexUpdater.java:132) at org.neo4j.kernel.impl.index.schema.NativeSchemaIndexUpdater.processUpdate(NativeSchemaIndexUpdater.java:86) at org.neo4j.kernel.impl.index.schema.NativeSchemaIndexUpdater.process(NativeSchemaIndexUpdater.java:61) at org.neo4j.kernel.impl.index.schema.fusion.FusionIndexUpdater.process(FusionIndexUpdater.java:41) at org.neo4j.kernel.impl.api.index.updater.DelegatingIndexUpdater.process(DelegatingIndexUpdater.java:40) at org.neo4j.kernel.impl.api.index.IndexingService.processUpdate(IndexingService.java:516) at org.neo4j.kernel.impl.api.index.IndexingService.apply(IndexingService.java:479) at org.neo4j.kernel.impl.api.index.IndexingService.apply(IndexingService.java:463) at org.neo4j.kernel.impl.transaction.command.IndexUpdatesWork.apply(IndexUpdatesWork.java:63) at org.neo4j.kernel.impl.transaction.command.IndexUpdatesWork.apply(IndexUpdatesWork.java:42) at org.neo4j.concurrent.WorkSync.doSynchronizedWork(WorkSync.java:231) at org.neo4j.concurrent.WorkSync.tryDoWork(WorkSync.java:157) at org.neo4j.concurrent.WorkSync.apply(WorkSync.java:91)
JAVA code implementation index
/** * @param * @return * @Description: TODO(Build index and return MESSAGE - automatic updates are not supported) */ private String chineseFulltextIndex(String indexName, String labelName, List<String> propKeys) { Label label = Label.label(labelName); // Find all nodes under this label by label ResourceIterator<Node> nodes = db.findNodes(label); System.out.println("nodes:" + nodes.toString()); int nodesSize = 0; int propertiesSize = 0; // Cycle Problem Updated to 30 million before the program started Karton while (nodes.hasNext()) { nodesSize++; Node node = nodes.next(); System.out.println("current nodes:" + node.toString()); // Properties on each node that need to be indexed Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0])).entrySet(); System.out.println("current node properties" + properties); // Query if the node has an index, or delete it if necessary if (db.index().existsForNodes(indexName)) { Index<Node> oldIndex = db.index().forNodes(indexName); System.out.println("current node index" + oldIndex); oldIndex.remove(node); } // Add a full-text index for each attribute of the node that needs to be indexed Index<Node> nodeIndex = db.index().forNodes(indexName, FULL_INDEX_CONFIG); for (Map.Entry<String, Object> property : properties) { propertiesSize++; nodeIndex.add(node, property.getKey(), property.getValue()); } // Time-consuming calculation } String message = "IndexName:" + indexName + ",LabelName:" + labelName + ",NodesSize:" + nodesSize + ",PropertiesSize:" + propertiesSize; return message; }
IV. Code optimization for full-text index
1,Java.lang.OutOfMemoryError
Java.lang.OutOfMemory is a subclass of java.lang.VirtualMachineError that is thrown when a Java virtual machine interrupts or exceeds available resources.
2. When accessing a database
Programs acquire locks and memory when accessing the database, which are not released until the transaction is completed.So it's easy to understand why these BUG s are happening.(3) In the indexing program implemented, the construction of index is performed in the WHILE cycle after the nodes are acquired, and the transaction will not be automatically closed until the index is built, and memory recycling will be performed automatically.When the amount of data acquired is huge, there will be memory overflow.
3. Optimizing scheme
Use the bulk transaction commit mechanism.
4. Optimize Code
/** * @param * @return * @Description: TODO(Build index and return MESSAGE - automatic updates are not supported) */ private String chineseFulltextIndex(String indexName, String labelName, List<String> propKeys) { Label label = Label.label(labelName); int nodesSize = 0; int propertiesSize = 0; // Find all nodes under this label by label ResourceIterator<Node> nodes = db.findNodes(label); Transaction tx = db.beginTx(); try { int batch = 0; long startTime = System.nanoTime(); while (nodes.hasNext()) { nodesSize++; Node node = nodes.next(); boolean indexed = false; // Properties on each node that need to be indexed Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0])).entrySet(); // Query if the node has an index, or delete it if necessary if (db.index().existsForNodes(indexName)) { Index<Node> oldIndex = db.index().forNodes(indexName); oldIndex.remove(node); } // Add a full-text index for each attribute of the node that needs to be indexed Index<Node> nodeIndex = db.index().forNodes(indexName, FULL_INDEX_CONFIG); for (Map.Entry<String, Object> property : properties) { indexed = true; propertiesSize++; nodeIndex.add(node, property.getKey(), property.getValue()); } // Bulk commit transactions if (indexed) { if (++batch == 50_000) { batch = 0; tx.success(); tx.close(); tx = db.beginTx(); // Time-consuming calculation startTime = indexConsumeTime(startTime, nodesSize, propertiesSize); } } } tx.success(); // Time-consuming calculation indexConsumeTime(startTime, nodesSize, propertiesSize); } finally { tx.close(); } String message = "IndexName:" + indexName + ",LabelName:" + labelName + ",NodesSize:" + nodesSize + ",PropertiesSize:" + propertiesSize; return message; }
5. Perform efficiency tests
50_000 submits for batches, adding up nodeSize and propertieSize, consume or the time taken for each batch submission.
You can see that it takes a lot of time to start submitting, and then basically stabilize the time spent in each batch: 2s~5s/50,000.1 billion nodes, time estimates between 11h and 23h.
Build index-nodeSize:50000,propertieSize:148777,consume:21434ms Build index-nodeSize:100000,propertieSize:297883,consume:18493ms Build index-nodeSize:150000,propertieSize:446936,consume:17140ms Build index-nodeSize:200000,propertieSize:595981,consume:17323ms Build index-nodeSize:250000,propertieSize:745039,consume:19680ms Build index-nodeSize:300000,propertieSize:894026,consume:18451ms Build index-nodeSize:350000,propertieSize:1042994,consume:20266ms Build index-nodeSize:400000,propertieSize:1160186,consume:12787ms Build index-nodeSize:450000,propertieSize:1210186,consume:1946ms Build index-nodeSize:500000,propertieSize:1260186,consume:3174ms Build index-nodeSize:550000,propertieSize:1310186,consume:3090ms Build index-nodeSize:600000,propertieSize:1360186,consume:3063ms Build index-nodeSize:650000,propertieSize:1410186,consume:1868ms Build index-nodeSize:700000,propertieSize:1460186,consume:2036ms Build index-nodeSize:750000,propertieSize:1510186,consume:3784ms Build index-nodeSize:800000,propertieSize:1560186,consume:3037ms Build index-nodeSize:850000,propertieSize:1610186,consume:2627ms Build index-nodeSize:900000,propertieSize:1660186,consume:1900ms Build index-nodeSize:950000,propertieSize:1710186,consume:2944ms Build index-nodeSize:1000000,propertieSize:1760186,consume:3369ms Build index-nodeSize:1050000,propertieSize:1810186,consume:3289ms Build index-nodeSize:1100000,propertieSize:1860186,consume:2763ms Build index-nodeSize:1150000,propertieSize:1910186,consume:3237ms Build index-nodeSize:1200000,propertieSize:1960186,consume:3408ms Build index-nodeSize:1250000,propertieSize:2010186,consume:3644ms Build index-nodeSize:1300000,propertieSize:2060186,consume:3661ms Build index-nodeSize:1350000,propertieSize:2110186,consume:2964ms Build index-nodeSize:1400000,propertieSize:2160186,consume:3219ms Build index-nodeSize:1450000,propertieSize:2210186,consume:3356ms Build index-nodeSize:1500000,propertieSize:2260186,consume:4115ms Build index-nodeSize:1550000,propertieSize:2310186,consume:3188ms Build index-nodeSize:1600000,propertieSize:2360186,consume:3364ms Build index-nodeSize:1650000,propertieSize:2410186,consume:3799ms Build index-nodeSize:1700000,propertieSize:2460186,consume:4301ms Build index-nodeSize:1750000,propertieSize:2510186,consume:3772ms Build index-nodeSize:1800000,propertieSize:2560186,consume:3692ms Build index-nodeSize:1850000,propertieSize:2610186,consume:3428ms Build index-nodeSize:1900000,propertieSize:2660186,consume:2930ms
Note: Two hours after index building was performed on the dataset under this test, 14.95 million NODES have been indexed at this time, which is slowing down significantly and needs further optimization.
Build index-nodeSize:13850000,propertieSize:14610186,consume:97290ms Build index-nodeSize:13900000,propertieSize:14660186,consume:7441ms Build index-nodeSize:13950000,propertieSize:14710186,consume:3730ms Build index-nodeSize:14000000,propertieSize:14760186,consume:3512ms Build index-nodeSize:14050000,propertieSize:14810186,consume:4545ms Build index-nodeSize:14100000,propertieSize:14860186,consume:12100ms Build index-nodeSize:14150000,propertieSize:14910186,consume:83071ms Build index-nodeSize:14200000,propertieSize:14960186,consume:7417ms Build index-nodeSize:14250000,propertieSize:15010186,consume:3579ms Build index-nodeSize:14300000,propertieSize:15060186,consume:64841ms Build index-nodeSize:14350000,propertieSize:15110186,consume:7553ms Build index-nodeSize:14400000,propertieSize:15160186,consume:63141ms Build index-nodeSize:14450000,propertieSize:15210186,consume:64316ms Build index-nodeSize:14500000,propertieSize:15260186,consume:187510ms Build index-nodeSize:14550000,propertieSize:15310186,consume:247571ms Build index-nodeSize:14600000,propertieSize:15360186,consume:224611ms Build index-nodeSize:14650000,propertieSize:15410186,consume:244539ms Build index-nodeSize:14700000,propertieSize:15460186,consume:354684ms Build index-nodeSize:14750000,propertieSize:15510186,consume:236970ms Build index-nodeSize:14800000,propertieSize:15560186,consume:308532ms Build index-nodeSize:14850000,propertieSize:15610186,consume:429815ms Build index-nodeSize:14900000,propertieSize:15660186,consume:409451ms Build index-nodeSize:14950000,propertieSize:15710186,consume:456980ms