Neo4j CSV导入查询超级慢,设置关系时

时间:2016-07-13 20:08:52

标签: neo4j cypher load-csv

我正在尝试评估Neo4j(使用社区版) 我正在使用LOAD CSV进程导入一些数据(100万行)。它需要匹配先前导入的节点以在它们之间创建关系。

这是我的问题:

//Query #3
//create edges between Tr and Ad nodes

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///1M.txt'
AS line
 FIELDTERMINATOR '\t'

//find appropriate tx and ad
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})

//create the edge (relationship)
CREATE (tx)-[out:OUT_TO]->(ad)

//set properties on the edge
SET out.id= TOINT(line.id)
SET out.n = TOINT(line.n)
SET out.v = TOINT(line.v)

我有以下指示:

Indexes
  ON :Ad(p58)       ONLINE (for uniqueness constraint) 
  ON :Tr(txid)      ONLINE                             
  ON :Tr(h)         ONLINE (for uniqueness constraint)

此查询现在已经运行了5天,到目前为止已经创建了270K关系(超过1M) Java堆是4g
机器有32G的RAM和一个用于驱动器的SSD,只运行linux和Neo4j

任何提示加快这一过程的提示都将受到高度赞赏 我应该试试企业版吗?

查询计划:

+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
If a part of a query contains multiple disconnected patterns, 
this will build a cartesian product between all those parts.
This may produce a large amount of data and slow down query processing.
While occasionally intended, 
it may often be possible to reformulate the query that avoids the use of this cross product,
 perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (ad))
20 ms

Compiler CYPHER 3.0

Planner COST

Runtime INTERPRETED

+---------------------------------+----------------+---------------------+----------------------------+
| Operator                        | Estimated Rows | Variables           | Other                      |
+---------------------------------+----------------+---------------------+----------------------------+
| +ProduceResults                 |              1 |                     |                            |
| |                               +----------------+---------------------+----------------------------+
| +EmptyResult                    |                |                     |                            |
| |                               +----------------+---------------------+----------------------------+
| +Apply                          |              1 | line -- ad, out, tx |                            |
| |\                              +----------------+---------------------+----------------------------+
| | +SetRelationshipProperty(4)   |              1 | ad, out, tx         |                            |
| | |                             +----------------+---------------------+----------------------------+
| | +CreateRelationship           |              1 | out -- ad, tx       |                            |
| | |                             +----------------+---------------------+----------------------------+
| | +ValueHashJoin                |              1 | ad -- tx            | ad.p58; line.p58           |
| | |\                            +----------------+---------------------+----------------------------+
| | | +NodeIndexSeek              |              1 | tx                  | :Tr(txid)                  |
| | |                             +----------------+---------------------+----------------------------+
| | +NodeUniqueIndexSeek(Locking) |              1 | ad                  | :Ad(p58)                   |
| |                               +----------------+---------------------+----------------------------+
| +LoadCSV                        |              1 | line                |                            |
+---------------------------------+----------------+---------------------+----------------------------+

1 个答案:

答案 0 :(得分:1)

好的,所以通过将MATCH语句分成两部分,它极大地加快了查询速度。谢谢@William Lyon指点我的计划。我注意到了警告。

旧的MATCH atatement

MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})

分成两部分:

MATCH (tx:Tr { txid: TOINT(line.txid) })
MATCH (ad:Ad {p58: line.p58})

在750K关系上,查询耗时83秒 接下来是9百万CSV LOAD