Question

我尝试过使用py2neo上传中型数据集的方法。就我而言，每天需要加载大约80个K节点和400 K边缘。我想分享一下我的经验，并向社区询问是否还有一种我没有遇到的更好的方式。

一种。 py2neo＆＃34; native＆＃34;命令。

使用graph.merge_one()创建节点，并使用push()设置属性。我很快就解除了这个问题，因为它非常慢，几分钟内甚至都没有超过10 K记录。毫不奇怪，py2neo＆＃39; documentation以及此处的一些帖子建议使用Cypher。

B中。没有分区的Cypher

在循环中使用py2neo.cypher.CypherTransaction append()，在结尾使用commit()。

# query sent to MSSQL. Returns ~ 80K records
result = engine.execute(query) 
statement = "MERGE (e:Entity {myid: {ID}}) SET e.p = 1"
# begin new Cypher transaction
tx = neoGraph.cypher.begin()
for row in result:
    tx.append(statement, {"ID": row.id_field})
tx.commit()

这会超时并崩溃Neo4j服务器。我理解问题是所有80个K Cypher语句都试图一次性执行。

℃。 Cypher with partitioning and one commit

我使用计数器和process()命令一次运行1000条语句。

# query sent to MSSQL. Returns ~ 80K records
result = engine.execute(query) 
statement = "MERGE (e:Entity {myid: {ID}}) SET e.p = 1"
counter = 0
tx = neoGraph.cypher.begin()
for row in result:
    counter += 1
    tx.append(statement, {"ID": row.id_field})
    if (counter == 1000):
        tx.process()    # process 1000 statements
        counter = 0
tx.commit()

这在开始时很快就会运行，但在处理了1000个事务时会变慢。最终，它在堆栈溢出时超时。这是令人惊讶的，因为我希望process()每次都重置堆栈。

d。 Cypher，为每个分区提供分区和提交

这是唯一运行良好的版本。对1000个交易的每个分区执行commit()，并使用begin()重新开始新的交易。

# query sent to MSSQL. Returns ~ 80K records
result = engine.execute(query) 
statement = "MERGE (e:Entity {myid: {ID}}) SET e.p = 1"
counter = 0
tx = neoGraph.cypher.begin()
for row in result:
    counter += 1
    tx.append(statement, {"ID": row.id_field})
    if (counter == 1000):
        tx.commit()                   # commit 1000 statements
        tx = neoGraph.cypher.begin()  # reopen transaction
        counter = 0
tx.commit()

这种方法运行得很快。

有任何意见吗？

Answer 1

正如您通过反复试验发现的那样，单个事务在不超过10K-50K的操作时表现最佳。您在 D 中描述的方法效果最好，因为您每1000个语句提交一次事务。您可以安全地增加批量大小。

您可能想要尝试的另一种方法是将值数组作为参数传递，并使用Cypher的UNWIND命令迭代它们。例如：

WITH {id_array} AS ids // something like [1,2,3,4,5,6]
UNWIND ids AS ident
MERGE (e:Entity {myid: ident})
SET e.p = 1

使用py2neo上传数据的最佳方式

一种。 py2neo＆＃34; native＆＃34;命令。

B中。没有分区的Cypher

℃。 Cypher with partitioning and one commit

d。 Cypher，为每个分区提供分区和提交

1 个答案: