奇怪的neo4j密码行为

时间:2015-10-04 12:09:12

标签: performance neo4j cypher

我对neo4j和cypher查询语言都很陌生。

我的节点/关系数据集基本上如下所示:

  1. 我在数据库中有大约27000个用户节点
  2. 我在数据库中有大约8000个问题节点
  3. 问题节点可以由用户节点回答,因此基本上存在关系,如(用户) - [:ANSWERED] - >(问题)
  4. 一些问题节点触发用户的属性,因此存在如(用户) - [:HAS_PROPERTY] - >(属性)
  5. 之类的关系
  6. 此外,一些问题节点需要一些属性才能得到答案。所以有像(问题) - [:需要] - >(属性)
  7. 这样的关系

    现在我的查询是关于找到特定用户尚未回答的问题,将问题属性要求考虑在内,限制为50个问题。

    在困扰了一会儿之后,我想出了以下问题:

    MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
    OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 
    WITH q, user 
    WHERE a IS NULL 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
    WITH q, user, count(r) as rCount 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
    WITH q, rCount, count(h) as hCount 
    WHERE rCount = 0 or rCount = hCount 
    RETURN q ORDER BY q.priority DESC LIMIT 50
    

    上面的查询给了我预期的行,并且非常快(大约150毫秒),这很棒。

    我不明白的是:

    当我使用用户变量替换查询中的第二行而不是执行标签查找时,查询变得非常慢。特别是对于已经回答了很多甚至所有问题的用户。

    因此以下查询要慢得多:

    MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
    OPTIONAL MATCH (user)-[a:ANSWERED]->(q) 
    WITH q, user 
    WHERE a IS NULL 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
    WITH q, user, count(r) as rCount 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
    WITH q, rCount, count(h) as hCount 
    WHERE rCount = 0 or rCount = hCount 
    RETURN q ORDER BY q.priority DESC LIMIT 50
    

    为什么会这样,因为我真的不明白?实际上我认为重新使用已匹配的用户作为第二个可选匹配的基础会更便宜。

    在我对cypher性能进行研究的过程中,我发现了很多文章,告诉我如果可能的话,你应该尽量避免使用可选的匹配。所以我的第一个查询如下所示:

    MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
    MATCH (q) WHERE NOT (q)<-[:ANSWERED]->(user) 
    WITH q, user 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
    WITH q, user, count(r) as rCount 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
    WITH q, rCount, count(h) as hCount 
    WHERE rCount = 0 or rCount = hCount 
    RETURN q ORDER BY q.priority DESC LIMIT 50
    

    这里的问题相同。上面的查询比第一个慢很多。大约慢20-30倍。

    最后,我想问一下我是否遗漏了一些东西,以及是否有更好的方法来实现我的目标。

    任何帮助都将不胜感激。

    此致

    亚历

    修改

    以下是一些分析详细信息:

    使用以下查询:

    MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
    OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 
    WITH q, user 
    WHERE a IS NULL 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
    WITH q, user, count(r) as rCount 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
    WITH q, rCount, count(h) as hCount 
    WHERE rCount = 0 or rCount = hCount 
    RETURN q ORDER BY q.priority DESC LIMIT 50
    
    Cypher version: CYPHER 2.2, planner: COST. 26979 total db hits in 169 ms.
    

    使用Michael Hunger建议的查询:

    MATCH (user:User {code: 'abc'})
    MATCH (:ActiveQuestions)-[]->(q:Question) 
    WHERE NOT (user)-[:ANSWERED]->(q) 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
    WITH q, user, count(r) as rCount 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
    WITH q, rCount, count(h) as hCount 
    WHERE rCount = 0 or rCount = hCount 
    RETURN q ORDER BY q.priority DESC LIMIT 50
    
    Cypher version: CYPHER 2.2, planner: COST. 2337573 total db hits in 2622 ms.
    

    所以我当前的查询更快更有效。

    我真正不理解的内容以及我为什么选择了帖子&#34;奇怪的neo4j密码行为&#34;事实是,当我修改我的快速查询的第二行时:

    OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 
    

    为:

    OPTIONAL MATCH (user)-[a:ANSWERED]->(q) 
    

    这对我来说有点简单和逻辑我得到以下内容:

    MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
    WHERE NOT (user)-[:ANSWERED]->(q) 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
    WITH q, user, count(r) as rCount 
    OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
    WITH q, rCount, count(h) as hCount 
    WHERE rCount = 0 or rCount = hCount 
    RETURN q ORDER BY q.priority DESC LIMIT 50
    
    Cypher version: CYPHER 2.2, planner: COST. 2337573 total db hits in 2391 ms.
    

    因此,我获得的数据库命中率与之前提到的慢速查询完全相同。

    有人对此有解释吗?

    此外,当我修改第一行时,它没有任何区别

    来自:

    MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
    

    为:

    MATCH (user:User {code: 'xyz'})
    MATCH (:ActiveQuestions)-[]->(q:Question)
    

    所以我基本上有两个问题:

    1. 与使用(user:User {code: 'xyz'})相比,重复使用已定义的用户节点变量(用户)时,为什么查询要慢得多

    2. 我的第二行使用的是等效的外连接。针对我提出的所有建议,这比使用MATCH (q) WHERE NOT (q)<-[:ANSWERED]->(user)要快得多。我认为后者也在进行外连接,但似乎并非如此。

      修改

    3. 经过一些进一步的分析后,我想出了一些更便宜的查询。请参阅以下分析详细信息:

      使用以下密码查询:

      MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q) 
      OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 
      WITH q, user 
      WHERE a IS NULL 
      OPTIONAL MATCH (q)-[r:REQUIRES]->(p) 
      WITH q, user, count(r) as rCount 
      OPTIONAL MATCH (q)-[r:REQUIRES]->(p)<-[h:HAS_PROPERTY]-(user)
      WITH q, rCount, count(h) as hCount 
      WHERE rCount = hCount 
      RETURN q ORDER BY q.priority DESC LIMIT 50
      
      Cypher version: CYPHER 2.2, planner: COST. 21669 total db hits in 120 ms.
      

      所以我基本上摆脱了示例中的显式节点标签(:Question)和(:Property),这对我来说听起来很合理,因为不再需要明确的标签扫描。这为我节省了大约5300次数据库命中。

      此查询可以调整其他任何内容吗?

1 个答案:

答案 0 :(得分:1)

您在第二场比赛中占据了很多行,您必须再次折叠,因此如果您将第一个WITH更改为with distinct q, user或聚合with q,user, count(*) as answers。然后你再次降低你的基数。

此外,我认为(:ActiveQuestions)-[]->(q:Question)

已经占据了很多行

如果使用PROFILE运行查询,则应该看到访问了多少数据。

一般情况下,我会尝试将OPTIONAL MATCH更改为WHERE条件,看看它是如何进行的。

顺便说一下。您可以将活动问题标记为:ActiveQuestion,不需要其他关系。我还添加了一个rel-type。

MATCH (user:User {code: 'xyz'})
MATCH (:ActiveQuestions)-[:IS_ACTIVE]->(q:Question) 
WHERE NOT (user)-[:ANSWERED]->(q) 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50