我正在尝试将数据库导出到文件并再次导入,而不复制实际的数据库文件或停止数据库。我意识到有许多优秀的(和高性能的)neo4j-shell-tools但是Neo4j数据库是远程的,export-*
和import-*
命令要求文件驻留在远程客户端上,而对于我的情况这些都存在于当地。
以下post解释了导出/导入数据的替代方法,但导入的性能并不过分。
以下示例使用我们的数据存储的子集,该子集包含具有各种标签/属性的10,000个节点以用于测试目的。首先,数据库是通过
导出的> time cypher-shell 'CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {batchSize: 1000, format: "cypher-shell", separateFiles: true})'
real 0m1.703s
然后擦拭,
neo4j stop
rm -rf /var/log/neo4j/data/databases/graph.db
neo4j start
重新导入前,
time cypher-shell < /tmp/graph.db.nodes.cypher
real 0m39.105s
这似乎并不过分。我还尝试了Python路由,通过以普通格式导出Cypher:
CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {format: "plain", separateFiles: true})
以下代码段在〜30秒内运行(批量大小为1,000),
from itertools import izip_longest
from neo4j.v1 import GraphDatabase
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
with session.begin_transaction() as tx:
for line in chunk:
if line:
tx.run(line)
我意识到参数化的Cypher查询更加优化我使用了一些有点克服的逻辑(注意字符串替换不足以满足所有情况)尝试从Cypher代码中提取标签和属性(在~8s内执行) :
from itertools import izip_longest
import json
from neo4j.v1 import GraphDatabase
import re
def decode(statement):
m = re.match('CREATE \((.*?)\s(.*?)\);', statement)
labels = m.group(1).replace('`', '').split(':')[1:]
properties = json.loads(m.group(2).replace('`', '"')) # kludgy
return labels, properties
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
with session.begin_transaction() as tx:
for line in chunk:
if line:
labels, properties = decode(line)
tx.run(
'CALL apoc.create.node({labels}, {properties})',
labels=labels,
properties=properties,
)
使用UNWIND
而不是事务进一步将性能提高到~5s:
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
rows = []
for line in chunk:
if line:
labels, properties = decode(line)
rows.append({'labels': labels, 'properties': properties})
session.run(
"""
UNWIND {rows} AS row
WITH row.labels AS labels, row.properties AS properties
CALL apoc.create.node(labels, properties) YIELD node
RETURN true
""",
rows=rows,
)
这是加速Cypher导入的正确方法吗?理想情况下,我不想在Python中进行这种级别的操作,因为它可能容易出错,而且我必须为关系做类似的事情。
也有人知道解码Cypher以提取属性的正确方法吗?如果属性中存在反向标记(`),则此方法将失败。注意我不想沿着GraphML路线走下去,因为我还需要通过Cypher格式导出的模式。虽然以这种方式打开Cypher包装确实很奇怪。
最后,作为参考,import-binary
shell命令需要大约3秒来执行相同的导入:
> neo4j-shell -c "import-binary -b 1000 -i /tmp/graph.db.bin"
...
finish after 10000 row(s) 10. 100%: nodes = 10000 rels = 0 properties = 106289 time 3 ms total 3221 ms