NEO400J Level Full-Text Index Construction Optimization

Posted by lisa007 on Tue, 07 May 2019 12:10:06 +0200

NEO400J Level Full-Text Index Construction Optimization

If NEO4J-based full-text retrieval is used as the main entry point for the map, it is critical to optimize the map search engine.

1. Scale of data volume (level 100 million)

count(relationships):500584016

count(nodes):765485810

2. Ways to Build Indexes

Use the script background server to perform the operation of building a full-text index.
Use a background script to execute the indexer:

index.sh
#!/usr/bin/env bash
nohup /neo4j-community-3.4.9/bin/neo4j-shell -file build.cql >>indexGraph.log 2>&1 &
build.cql
CALL zdr.index.addChineseFulltextIndex('IKAnalyzer', ['description','fullname','name','lnkurl'], 'LinkedinID') YIELD message RETURN message;

3. Exceptions Occurring in Index Building

ERROR (-v for expanded information):
	TransactionFailureException: The database has encountered a critical error, and needs to be restarted. Please see database logs for more details.

 -host      Domain name or IP of host to connect to (default: localhost)
 -port      Port of host to connect to (default: 1337)
 -name      RMI name, i.e. rmi://<host>:<port>/<name> (default: shell)
 -pid       Process ID to connect to
 -c         Command line to execute. After executing it the shell exits
 -file      File containing commands to execute, or '-' to read from stdin. After executing it the shell exits
 -readonly  Connect in readonly mode (only for connecting with -path)
 -path      Points to a neo4j db path so that a local server can be started there
 -config    Points to a config file when starting a local server

Example arguments for remote:
	-port 1337
	-host 192.168.1.234 -port 1337 -name shell
	-host localhost -readonly
	...or no arguments for default values
Example arguments for local:
	-path /path/to/db
	-path /path/to/db -config /path/to/neo4j.config
	-path /path/to/db -readonly
Caused by: java.lang.OutOfMemoryError: Java heap space | GB+Tree[file:/u02/isi/zdr/graph/neo4j-community-3.4.9/data/databases/graph.db/schema/index/lucene_native-2.0/134/string-1.0/index-134, layout:StringLayout[version:0.1, identifier:24016946018123776], generation:16587/16588]
        at org.neo4j.io.pagecache.impl.muninn.CursorFactory.takeWriteCursor(CursorFactory.java:62)
        at org.neo4j.io.pagecache.impl.muninn.MuninnPagedFile.io(MuninnPagedFile.java:186)
        at org.neo4j.index.internal.gbptree.FreeListIdProvider.releaseId(FreeListIdProvider.java:217)
        at org.neo4j.index.internal.gbptree.InternalTreeLogic.createSuccessorIfNeeded(InternalTreeLogic.java:1289)
        at org.neo4j.index.internal.gbptree.InternalTreeLogic.insertInLeaf(InternalTreeLogic.java:513)
        at org.neo4j.index.internal.gbptree.InternalTreeLogic.insert(InternalTreeLogic.java:356)
        at org.neo4j.index.internal.gbptree.GBPTree$SingleWriter.merge(GBPTree.java:1234)
        at org.neo4j.kernel.impl.index.schema.NativeSchemaIndexUpdater.processAdd(NativeSchemaIndexUpdater.java:132)
        at org.neo4j.kernel.impl.index.schema.NativeSchemaIndexUpdater.processUpdate(NativeSchemaIndexUpdater.java:86)
        at org.neo4j.kernel.impl.index.schema.NativeSchemaIndexUpdater.process(NativeSchemaIndexUpdater.java:61)
        at org.neo4j.kernel.impl.index.schema.fusion.FusionIndexUpdater.process(FusionIndexUpdater.java:41)
        at org.neo4j.kernel.impl.api.index.updater.DelegatingIndexUpdater.process(DelegatingIndexUpdater.java:40)
        at org.neo4j.kernel.impl.api.index.IndexingService.processUpdate(IndexingService.java:516)
        at org.neo4j.kernel.impl.api.index.IndexingService.apply(IndexingService.java:479)
        at org.neo4j.kernel.impl.api.index.IndexingService.apply(IndexingService.java:463)
        at org.neo4j.kernel.impl.transaction.command.IndexUpdatesWork.apply(IndexUpdatesWork.java:63)
        at org.neo4j.kernel.impl.transaction.command.IndexUpdatesWork.apply(IndexUpdatesWork.java:42)
        at org.neo4j.concurrent.WorkSync.doSynchronizedWork(WorkSync.java:231)
        at org.neo4j.concurrent.WorkSync.tryDoWork(WorkSync.java:157)
        at org.neo4j.concurrent.WorkSync.apply(WorkSync.java:91)

JAVA code implementation index

    /**
     * @param
     * @return
     * @Description: TODO(Build index and return MESSAGE - automatic updates are not supported)
     */
    private String chineseFulltextIndex(String indexName, String labelName, List<String> propKeys) {

        Label label = Label.label(labelName);

        // Find all nodes under this label by label
        ResourceIterator<Node> nodes = db.findNodes(label);
        System.out.println("nodes:" + nodes.toString());

        int nodesSize = 0;
        int propertiesSize = 0;

        // Cycle Problem Updated to 30 million before the program started Karton
        while (nodes.hasNext()) {
            nodesSize++;
            Node node = nodes.next();
            System.out.println("current nodes:" + node.toString());

            // Properties on each node that need to be indexed
            Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0])).entrySet();
            System.out.println("current node properties" + properties);

            // Query if the node has an index, or delete it if necessary
            if (db.index().existsForNodes(indexName)) {
                Index<Node> oldIndex = db.index().forNodes(indexName);
                System.out.println("current node index" + oldIndex);
                oldIndex.remove(node);
            }

            // Add a full-text index for each attribute of the node that needs to be indexed
            Index<Node> nodeIndex = db.index().forNodes(indexName, FULL_INDEX_CONFIG);
            for (Map.Entry<String, Object> property : properties) {
                propertiesSize++;
                nodeIndex.add(node, property.getKey(), property.getValue());
            }
            // Time-consuming calculation
        }

        String message = "IndexName:" + indexName + ",LabelName:" + labelName + ",NodesSize:" + nodesSize + ",PropertiesSize:" + propertiesSize;
        return message;
    }

IV. Code optimization for full-text index

1,Java.lang.OutOfMemoryError

Java.lang.OutOfMemory is a subclass of java.lang.VirtualMachineError that is thrown when a Java virtual machine interrupts or exceeds available resources.

2. When accessing a database

Programs acquire locks and memory when accessing the database, which are not released until the transaction is completed.So it's easy to understand why these BUG s are happening.(3) In the indexing program implemented, the construction of index is performed in the WHILE cycle after the nodes are acquired, and the transaction will not be automatically closed until the index is built, and memory recycling will be performed automatically.When the amount of data acquired is huge, there will be memory overflow.

3. Optimizing scheme

Use the bulk transaction commit mechanism.

4. Optimize Code

 /**
     * @param
     * @return
     * @Description: TODO(Build index and return MESSAGE - automatic updates are not supported)
     */
    private String chineseFulltextIndex(String indexName, String labelName, List<String> propKeys) {

        Label label = Label.label(labelName);

        int nodesSize = 0;
        int propertiesSize = 0;

        // Find all nodes under this label by label
        ResourceIterator<Node> nodes = db.findNodes(label);
        Transaction tx = db.beginTx();
        try {
            int batch = 0;
            long startTime = System.nanoTime();
            while (nodes.hasNext()) {
                nodesSize++;
                Node node = nodes.next();

                boolean indexed = false;
                // Properties on each node that need to be indexed
                Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0])).entrySet();

                // Query if the node has an index, or delete it if necessary
                if (db.index().existsForNodes(indexName)) {
                    Index<Node> oldIndex = db.index().forNodes(indexName);
                    oldIndex.remove(node);
                }

                // Add a full-text index for each attribute of the node that needs to be indexed
                Index<Node> nodeIndex = db.index().forNodes(indexName, FULL_INDEX_CONFIG);
                for (Map.Entry<String, Object> property : properties) {
                    indexed = true;
                    propertiesSize++;
                    nodeIndex.add(node, property.getKey(), property.getValue());
                }
                // Bulk commit transactions
                if (indexed) {
                    if (++batch == 50_000) {
                        batch = 0;
                        tx.success();
                        tx.close();
                        tx = db.beginTx();

                        // Time-consuming calculation
                        startTime = indexConsumeTime(startTime, nodesSize, propertiesSize);
                    }
                }
            }
            tx.success();
            // Time-consuming calculation
            indexConsumeTime(startTime, nodesSize, propertiesSize);
        } finally {
            tx.close();
        }

        String message = "IndexName:" + indexName + ",LabelName:" + labelName + ",NodesSize:" + nodesSize + ",PropertiesSize:" + propertiesSize;
        return message;
    }

5. Perform efficiency tests

50_000 submits for batches, adding up nodeSize and propertieSize, consume or the time taken for each batch submission.
You can see that it takes a lot of time to start submitting, and then basically stabilize the time spent in each batch: 2s~5s/50,000.1 billion nodes, time estimates between 11h and 23h.

Build index-nodeSize:50000,propertieSize:148777,consume:21434ms
Build index-nodeSize:100000,propertieSize:297883,consume:18493ms
Build index-nodeSize:150000,propertieSize:446936,consume:17140ms
Build index-nodeSize:200000,propertieSize:595981,consume:17323ms
Build index-nodeSize:250000,propertieSize:745039,consume:19680ms
Build index-nodeSize:300000,propertieSize:894026,consume:18451ms
Build index-nodeSize:350000,propertieSize:1042994,consume:20266ms
Build index-nodeSize:400000,propertieSize:1160186,consume:12787ms
Build index-nodeSize:450000,propertieSize:1210186,consume:1946ms
Build index-nodeSize:500000,propertieSize:1260186,consume:3174ms
Build index-nodeSize:550000,propertieSize:1310186,consume:3090ms
Build index-nodeSize:600000,propertieSize:1360186,consume:3063ms
Build index-nodeSize:650000,propertieSize:1410186,consume:1868ms
Build index-nodeSize:700000,propertieSize:1460186,consume:2036ms
Build index-nodeSize:750000,propertieSize:1510186,consume:3784ms
Build index-nodeSize:800000,propertieSize:1560186,consume:3037ms
Build index-nodeSize:850000,propertieSize:1610186,consume:2627ms
Build index-nodeSize:900000,propertieSize:1660186,consume:1900ms
Build index-nodeSize:950000,propertieSize:1710186,consume:2944ms
Build index-nodeSize:1000000,propertieSize:1760186,consume:3369ms
Build index-nodeSize:1050000,propertieSize:1810186,consume:3289ms
Build index-nodeSize:1100000,propertieSize:1860186,consume:2763ms
Build index-nodeSize:1150000,propertieSize:1910186,consume:3237ms
Build index-nodeSize:1200000,propertieSize:1960186,consume:3408ms
Build index-nodeSize:1250000,propertieSize:2010186,consume:3644ms
Build index-nodeSize:1300000,propertieSize:2060186,consume:3661ms
Build index-nodeSize:1350000,propertieSize:2110186,consume:2964ms
Build index-nodeSize:1400000,propertieSize:2160186,consume:3219ms
Build index-nodeSize:1450000,propertieSize:2210186,consume:3356ms
Build index-nodeSize:1500000,propertieSize:2260186,consume:4115ms
Build index-nodeSize:1550000,propertieSize:2310186,consume:3188ms
Build index-nodeSize:1600000,propertieSize:2360186,consume:3364ms
Build index-nodeSize:1650000,propertieSize:2410186,consume:3799ms
Build index-nodeSize:1700000,propertieSize:2460186,consume:4301ms
Build index-nodeSize:1750000,propertieSize:2510186,consume:3772ms
Build index-nodeSize:1800000,propertieSize:2560186,consume:3692ms
Build index-nodeSize:1850000,propertieSize:2610186,consume:3428ms
Build index-nodeSize:1900000,propertieSize:2660186,consume:2930ms

Note: Two hours after index building was performed on the dataset under this test, 14.95 million NODES have been indexed at this time, which is slowing down significantly and needs further optimization.

Build index-nodeSize:13850000,propertieSize:14610186,consume:97290ms
Build index-nodeSize:13900000,propertieSize:14660186,consume:7441ms
Build index-nodeSize:13950000,propertieSize:14710186,consume:3730ms
Build index-nodeSize:14000000,propertieSize:14760186,consume:3512ms
Build index-nodeSize:14050000,propertieSize:14810186,consume:4545ms
Build index-nodeSize:14100000,propertieSize:14860186,consume:12100ms
Build index-nodeSize:14150000,propertieSize:14910186,consume:83071ms
Build index-nodeSize:14200000,propertieSize:14960186,consume:7417ms
Build index-nodeSize:14250000,propertieSize:15010186,consume:3579ms
Build index-nodeSize:14300000,propertieSize:15060186,consume:64841ms
Build index-nodeSize:14350000,propertieSize:15110186,consume:7553ms
Build index-nodeSize:14400000,propertieSize:15160186,consume:63141ms
Build index-nodeSize:14450000,propertieSize:15210186,consume:64316ms
Build index-nodeSize:14500000,propertieSize:15260186,consume:187510ms
Build index-nodeSize:14550000,propertieSize:15310186,consume:247571ms
Build index-nodeSize:14600000,propertieSize:15360186,consume:224611ms
Build index-nodeSize:14650000,propertieSize:15410186,consume:244539ms
Build index-nodeSize:14700000,propertieSize:15460186,consume:354684ms
Build index-nodeSize:14750000,propertieSize:15510186,consume:236970ms
Build index-nodeSize:14800000,propertieSize:15560186,consume:308532ms
Build index-nodeSize:14850000,propertieSize:15610186,consume:429815ms
Build index-nodeSize:14900000,propertieSize:15660186,consume:409451ms
Build index-nodeSize:14950000,propertieSize:15710186,consume:456980ms

Topics: Java Database shell Attribute