我有一个大约100GB的数据集,分为50个文件,我想将其导入到32 GB RAM AWS服务器上的16 GB堆和8 GB页面缓存的空neo4j-3.5.5企业中。数据采用JSON行格式,并通过python输入以下查询:
WITH {list} as list
UNWIND list as data
MERGE (p:LabelA {id: data.id})
SET p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8
MERGE (j:LabelF {name: data.prop9})
MERGE (p) -[:RELATION_A {prop1: data.prop10, prop2: data.prop11}]-> (j)
MERGE (v:LabelC {name: data.prop12})
MERGE (p) -[:RELATION_B]-> (v)
FOREACH (elem IN data.prop13 |
MERGE (a:LabelB {id: elem.id}) ON CREATE
SET a.name = elem.name
MERGE (a) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop14 |
MERGE (u:LabelF {name: elem.name})
MERGE (u) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop15 |
MERGE (e:LabelD {name: elem})
MERGE (p) -[:RELATION_D]-> (e)
)
FOREACH (elem IN data.prop16 |
MERGE (out:LabelA {id: elem})
MERGE (p) -[:RELATION_E]-> (out)
)
FOREACH (elem IN data.prop17 |
MERGE (inc:LabelA {id: elem})
MERGE (p) <-[:RELATION_E]- (inc)
)
FOREACH (elem IN data.prop18 |
MERGE (pdf:LabelG {name: elem})
MERGE (p) -[:RELATION_F]-> (pdf)
)
FOREACH (elem IN data.prop19 |
MERGE (s:LabelE {name: elem})
MERGE (p) -[:RELATION_G]-> (s)
)
list
包含200条JSON行,每个查询都在自己的事务中运行。
在数据导入之前设置索引:
self.graph.run('CREATE INDEX ON :LabelB(name)')
self.graph.run('CREATE CONSTRAINT ON (p:LabelA) ASSERT (p.id) IS NODE KEY;')
self.graph.run('CREATE CONSTRAINT ON (p:LabelB) ASSERT (p.id) IS NODE KEY;')
for label in ['LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelF']:
self.graph.run(f'CREATE CONSTRAINT ON (p:{label}) ASSERT (p.name) IS NODE KEY;')
前几个检查点仍然比较快(?):
2019-05-23 15:49:10.141+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 45 checkpoint completed in 134ms
2019-05-23 16:04:45.515+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 1603 checkpoint completed in 35s 345ms
2019-05-23 16:22:38.299+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 3253 checkpoint completed in 2m 52s 483ms
但是在某些时候,每个检查点的持续时间大约为20-25分钟(这是先前的尝试):
2019-05-23 07:40:03.755 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18240 checkpoint started...
2019-05-23 07:42:15.942 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.144] version=144, last transaction in previous log=18253
2019-05-23 07:45:51.982 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=224, gcTime=240, gcCount=1}
2019-05-23 07:46:42.059 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 18279 to [/data/databases/graph.db/neostore.counts.db.a], from [/data/databases/graph.db/neostore.counts.db.b].
2019-05-23 07:53:49.108 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=158, gcTime=157, gcCount=1}
2019-05-23 08:03:11.556 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18240 checkpoint completed in 23m 7s 800ms
2019-05-23 08:03:11.710 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 140-141, last checkpoint was made in version 143
2019-05-23 08:04:38.454 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.145] version=145, last transaction in previous log=18377
2019-05-23 08:05:57.288 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=248, gcTime=253, gcCount=1}
2019-05-23 08:11:08.331 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=143, gcTime=224, gcCount=1}
2019-05-23 08:16:37.491 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=228, gcTime=237, gcCount=1}
2019-05-23 08:18:11.732 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18471 checkpoint started...
2019-05-23 08:23:18.767 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.146] version=146, last transaction in previous log=18496
2019-05-23 08:24:55.141 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 18505 to [/data/databases/graph.db/neostore.counts.db.b], from [/data/databases/graph.db/neostore.counts.db.a].
2019-05-23 08:38:21.660 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=136, gcTime=195, gcCount=1}
2019-05-23 08:40:46.261 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18471 checkpoint completed in 22m 34s 529ms
2019-05-23 08:40:46.281 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 142-143, last checkpoint was made in version 145
有人可以告诉我这里发生了什么吗?我尝试修改事务保留和日志大小属性(增加/减少)无济于事,并在具有24 GB堆和24 GB页面缓存的64GB AWS服务器上运行了此操作。在所有情况下,完成检查点所需的时间都在增加。虽然前两个文件用了不到两个小时的时间,但是我在6小时后中止了第三个文件的导入过程,因为它会塞满15分钟(检查点之间的默认时间),然后在检查点停留25分钟。
更新2019-05-24 17:35 UTC + 2
我尝试使用CREATE
建立关系(例如,在100GB中仅表示一次关系)来建立Cybersam解决方案的第一部分,然后放弃创建RELATION_E
在稍后阶段使用预处理的输入文件。这将导致以下导入查询:
WITH {list} as list
UNWIND list as data
MERGE (p:LabelA {id: data.id})
SET p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8
MERGE (j:LabelF {name: data.prop9})
CREATE (p) -[:RELATION_A {prop1: data.prop10, prop2: data.prop11}]-> (j)
MERGE (v:LabelC {name: data.prop12})
CREATE (p) -[:RELATION_B]-> (v)
FOREACH (elem IN data.prop13 |
MERGE (a:LabelB {id: elem.id}) ON CREATE
SET a.name = elem.name
CREATE (a) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop14 |
MERGE (u:LabelF {name: elem.name})
CREATE (u) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop15 |
MERGE (e:LabelD {name: elem})
CREATE (p) -[:RELATION_D]-> (e)
)
FOREACH (elem IN data.prop18 |
MERGE (pdf:LabelG {name: elem})
CREATE (p) -[:RELATION_F]-> (pdf)
)
FOREACH (elem IN data.prop19 |
MERGE (s:LabelE {name: elem})
CREATE (p) -[:RELATION_G]-> (s)
)
然后,我停止了Neo4j,删除了graph.db目录,更改了配置以每15秒执行一次检查点,以便我可以快速确定检查点时间是否仍然增加并开始数据导入。不幸的是,时间仍然在增加:
2019-05-24 15:25:40.718+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 59 checkpoint completed in 240ms
2019-05-24 15:26:02.003+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 86 checkpoint completed in 1s 283ms
2019-05-24 15:26:27.518+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 105 checkpoint completed in 5s 514ms
2019-05-24 15:26:55.079+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 141 checkpoint completed in 7s 559ms
2019-05-24 15:27:23.944+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 179 checkpoint completed in 8s 864ms
2019-05-24 15:27:59.768+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 218 checkpoint completed in 15s 823ms
2019-05-24 15:28:42.819+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 269 checkpoint completed in 23s 9ms
2019-05-24 15:29:33.318+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 328 checkpoint completed in 30s 498ms
2019-05-24 15:30:32.847+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 397 checkpoint completed in 39s 489ms
2019-05-24 15:31:41.918+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 480 checkpoint completed in 49s 30ms
2019-05-24 15:33:03.113+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 576 checkpoint completed in 1m 1s 194ms
在某处缺少索引吗?
更新2019-05-28 18:44 UTC + 2
我创建了一个100行的参数集,并使用PROFILE
将其导入到一个空的Neo4j中。查询计划如下所示:
答案 0 :(得分:1)
关系的每个MERGE
操作的成本将随着必须扫描的关系数量线性增加。节点索引将有助于优化查找关系的端点,但是neo4j仍然必须扫描这些端点之一的关系,以便MERGE
知道所需的关系是否已经存在。>
因此,查询的执行时间将随着关系终结点所需的各个节点所拥有的关系数量的增加而增加(我想这是您反复执行查询时发生的情况)。
这是解决此问题的两步过程。
使用仅使用MERGE
创建节点(没有任何关系)的查询。在这种情况下,您可以继续使用MERGE
,因为您只在处理索引节点。例如:
UNWIND $list as data
MERGE (p:LabelA {id: data.id})
ON CREATE SET
p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8
MERGE (j:LabelF {name: data.name})
MERGE (v:LabelC {name: data.propA})
FOREACH (e IN data.propB |
MERGE (a:LabelB {id: e.id}) ON CREATE SET a.name = e.name)
FOREACH (e IN data.propC |
MERGE (:LabelF {name: e.name}))
FOREACH (e IN data.propD + data.propG + data.propH |
MERGE (:LabelD {name: e}))
FOREACH (e IN data.propE + data.propF |
MERGE (:LabelA {id: e}))
使用一个查询来处理每个关系一次,从而使您可以使用CREATE
(不需要扫描)代替MERGE
。 / p>
注意:第二步要求没有2个$list
参数(在单独的查询调用中使用)包含将导致创建相同关系的数据。同样的约束也存在于单个$list
参数中。生成此类$list
参数是您的一项练习。
一旦有了适当的$list
参数,就可以这样做:
UNWIND $list as data
MATCH (p:LabelA {id: data.id})
MATCH (j:LabelF {name: data.name})
CREATE (p) -[:RELATION_A {prop1: data.prop1, prop2: data.prop2}]-> (j)
WITH p, data
MATCH (v:LabelC) WHERE v.name IN data.propA
CREATE (p) -[:RELATION_B]-> (v)
WITH p, data
UNWIND data.propB as elem
MATCH (a:LabelB {id: elem.id})
CREATE (a) -[:RELATION_C]-> (p)
WITH p, data
UNWIND data.propC as elem
MATCH (u:LabelF) WHERE u.name IN elem.name
CREATE (u) -[:RELATION_C]-> (p)
WITH p, data
MATCH (e:LabelD) WHERE e.name IN data.propD
CREATE (p) -[:RELATION_D]-> (e)
WITH p, data
MATCH (out:LabelA) WHERE out.id IN data.propE
CREATE (p) -[:RELATION_E]-> (out)
WITH p, data
MATCH (inc:LabelA) WHERE inc.id IN data.propF
CREATE (p) <-[:RELATION_E]- (inc)
WITH p, data
MATCH (pdf:G) WHERE pdf.name IN data.propG
CREATE (p) -[:RELATION_F]-> (pdf)
WITH p, data
MATCH (s:LabelE) WHERE s.name IN data.propH
CREATE (p) -[:RELATION_G]-> (s)