neo4j数据格式化会减慢查询

时间:2018-04-23 12:54:52

标签: node.js neo4j cypher

我有一个相当长的Cypher查询,我需要在大约100万个节点的Neo4j数据库上运行。实际的查询非常简单,但我添加了一些逻辑来将数据格式化为可以在我们将数据返回到API后快速使用的内容。此外,数据的完整性并不完美,我们有一些重复的条目,所以我也添加了逻辑来处理这个问题。但是,查询当前需要大约一分钟才能执行,这肯定太慢了。 我注意到,只返回节点而不进行任何数据格式化会将查询速度提高到大约一秒钟,因此我非常有信心在格式化数据并确保唯一性时出现问题。我认为使用DISTINCT可能会让事情变得缓慢,但在删除之后,我看不到任何明显的加速。我在这里发出任何明显会影响性能的明显错误吗?

原始查询

OPTIONAL MATCH
(p:Person)-[:WROTE]->(b:Edition)<-[:PUBLISHED]-(pub:Publisher)-[:PUBLISHES_IN]->(plc:Place)
WHERE
b.title =~ { regex }
WITH
{
    title: b.title,
    isbn: b.isbn,
    date: toString( b.date ),
    id: toString( id(b) ),
    authors: collect(
    DISTINCT {
        name: p.name,
        id: toString( id(p) ) 
    }), 
    publishers: collect(
    DISTINCT {
        name: pub.name,
        id: toString( id(pub) )
    }),
    places: collect(
    DISTINCT {
        name: plc.name,
        id: toString( id(plc) )
    }),
    relationships: {
        wrote: collect(
        DISTINCT [
            toString( id(p) ),
            toString( id(b) )
        ]),
        published: collect(
        DISTINCT [
            toString( id(pub) ),
            toString( id(b) )
        ]),
        publishes_in: collect(
        DISTINCT [
            toString( id(pub) ),
            toString( id(plc) )
        ])
    }
} as tmp
WITH
collect( DISTINCT tmp ) as records

UNWIND records as r
RETURN DISTINCT
    CASE
        WHEN (r.title IS NULL OR r.authors IS NULL OR r.publishers IS NULL) THEN NULL
        ELSE r
    END AS res
LIMIT { limit }

没有格式化逻辑的查询

OPTIONAL MATCH
(p:Person)-[:WROTE]->(b:Edition)<-[:PUBLISHED]-(pub:Publisher)-[:PUBLISHES_IN]->(plc:Place)
WHERE
b.title =~ { regex }
RETURN p, b, pub, plc
LIMIT { limit }

我可以只返回节点和关系,然后使用javascript进行进一步的数据处理,但是neo4j以稍微混乱的格式从驱动程序返回数据,而且如果可能的话我宁愿在Cypher中执行此操作。提前谢谢!

1 个答案:

答案 0 :(得分:0)

首先,您使用的是regex,它无法使用索引。如果可能,我建议您使用CONTAINSSTARTS WITHENDS WITH

在您的查询中,您正在执行大量COLLECTDISTINCT。那些元素消耗了一些时间和大量的RAM。

在您的查询结束时,您正在执行COLLECT,然后执行UNWIND。这部分可以简化为:

WITH collect( DISTINCT tmp ) as records
UNWIND records as r
RETURN DISTINCT
    CASE
        WHEN (r.title IS NULL OR r.authors IS NULL OR r.publishers IS NULL) THEN NULL
        ELSE r
    END AS res

成为:

WITH DISTINCT tmp as r
RETURN DISTINCT
    CASE
        WHEN (r.title IS NULL OR r.authors IS NULL OR r.publishers IS NULL) THEN NULL
        ELSE r
    END AS res

此外,在您似乎过滤结果的情况下。您应该将这些条件直接放在MATCH子句中。

所以最后你的查询应该是:

OPTIONAL MATCH (p:Person)-[:WROTE]->(b:Edition)<-[:PUBLISHED]-(pub:Publisher)-[:PUBLISHES_IN]->(plc:Place)
WHERE
  b.title =~ { regex } AND 
  NOT b.title IS NULL AND
  size((b)<-[:WROTE]-()) > 0 AND
  size((b)<-[:PUBLISHED]-()) > 0

WITH
{
    title: b.title,
    isbn: b.isbn,
    date: toString( b.date ),
    id: toString( id(b) ),
    authors: collect(
    DISTINCT {
        name: p.name,
        id: toString( id(p) ) 
    }), 
    publishers: collect(
    DISTINCT {
        name: pub.name,
        id: toString( id(pub) )
    }),
    places: collect(
    DISTINCT {
        name: plc.name,
        id: toString( id(plc) )
    }),
    relationships: {
        wrote: collect(
        DISTINCT [
            toString( id(p) ),
            toString( id(b) )
        ]),
        published: collect(
        DISTINCT [
            toString( id(pub) ),
            toString( id(b) )
        ]),
        publishes_in: collect(
        DISTINCT [
            toString( id(pub) ),
            toString( id(plc) )
        ])
    }
} as record
RETURN DISTINCT record
LIMIT {limit}