我正在努力在我们的Neo4J图表之上实施推荐系统,并开始查看我计划使用的查询,但它的执行速度比我预期的慢得多
统计
Neo4J Version: 2.3.1
Nodes: 820K
Relationships: 7.6M
我已经对查询优化进行了相当多的研究,但据我所知,我没有在查询结构中出现任何常见/常见的陷阱(但我不是专家)
这是一个带有测试数据集的开发控制台:http://console.neo4j.org/r/b7jk2b
查询
MATCH (u1:User {id: {user_id}})-[l1:LIKES]->(p1:Product)
WITH u1, l1, p1
ORDER BY p1.created_at DESC
LIMIT 10
MATCH (p1)<-[:LIKES]-(u2:User)
WHERE NOT u1=u2
WITH u1, l1, p1, u2, COUNT(u2) as rating
ORDER BY rating DESC
LIMIT 50
MATCH (u2)-[l2:LIKES]->(recommendation:Product)
WHERE NOT (p1)=(recommendation)
WITH recommendation, COUNT(recommendation) as weight
RETURN recommendation.id as id
ORDER BY weight DESC
LIMIT {limit}
我们的索引
Indexes
ON :LIKES(created_at) ONLINE
ON :Product(id) ONLINE
ON :Product(created_at) ONLINE
ON :User(id) ONLINE
ON :User(date_joined) ONLINE
No constraints
查询配置文件输出(针对我们的生产数据集的副本)
+-------------------+----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-------------------+----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +ProduceResults | 7 | 100 | 0 | id | id |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Projection | 7 | 100 | 0 | anon[382], id, recommendation, weight | anon[382] |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Top | 7 | 100 | 0 | anon[382], recommendation, weight | Literal(100); weight |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Projection | 7 | 129342 | 129342 | anon[382], recommendation, weight | recommendation.id; weight |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +EagerAggregation | 7 | 129342 | 0 | recommendation, weight | recommendation |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Filter | 44 | 442432 | 471953 | l1, l2, p1, rating, recommendation, u1, u2 | Ands(NOT(p1 == recommendation), recommendation:Product) |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Expand(All) | 44 | 472039 | 472089 | l1, l2, p1, rating, recommendation, u1, u2 | (u2)-[l2:LIKES]->(recommendation) |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Top | 10 | 50 | 0 | l1, p1, rating, u1, u2 | Literal(50); rating |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +EagerAggregation | 10 | 527 | 0 | l1, p1, rating, u1, u2 | u1, l1, p1, u2 |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Filter | 92 | 563 | 563 | anon[82], anon[119], l1, p1, u1, u2 | Ands(NOT(u1 == u2), u2:User) |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Expand(All) | 92 | 574 | 584 | anon[82], anon[119], l1, p1, u1, u2 | (p1)<-[:LIKES]-(u2) |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Top | 5 | 10 | 0 | anon[82], l1, p1, u1 | Literal(10); |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Projection | 5 | 42 | 42 | anon[82], l1, p1, u1 | u1; l1; p1; p1.created_at |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Filter | 5 | 42 | 413 | l1, p1, u1 | p1:Product |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +Expand(All) | 6 | 413 | 414 | l1, p1, u1 | (u1)-[l1:LIKES]->(p1) |
| | +----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
| +NodeIndexSeek | 1 | 1 | 2 | u1 | :User(id) |
+-------------------+----------------+--------+---------+--------------------------------------------+---------------------------------------------------------+
我已经看过人们使用Neo4j进行实时协同过滤的案例研究,所以我认为必须有可能让这种查询在这种数据集上运行。我不现实吗?我们在Amazon EC2 Compute-Optimized节点(c4.large)上运行此操作,因此我认为它具有相当的性能。
我在这里摸不着头脑,非常感谢任何投入。
干杯, 大卫。
答案 0 :(得分:0)
[Aside: The dev console, when reopened, does not re-create indexes, so they have to be manually recreated.]
I don't know if this is good enough for you, but you can eliminate about 44% of the DB hits in your profiled results by simply not specifying the labels for most of the nodes (p1
, u2
, and recommendation
) in your query:
MATCH (u1:User {id: {user_id}})-[l1:LIKES]->(p1)
WITH u1, l1, p1
ORDER BY p1.created_at DESC
LIMIT 10
MATCH (p1)<-[:LIKES]-(u2)
WHERE NOT u1=u2
WITH u1, l1, p1, u2, COUNT(u2) as rating
ORDER BY rating DESC
LIMIT 50
MATCH (u2)-[l2:LIKES]->(recommendation)
WHERE NOT (p1)=(recommendation)
WITH recommendation, COUNT(recommendation) as weight
RETURN recommendation.id as id
ORDER BY weight DESC
LIMIT {limit}
The label for u1
should still be specified in the query, since, that allows Cypher to index on :User(id)
. In general, one should carefully evaluate a query to see when node labels can be eliminated. In your case, the p1
, u2
, and recommendation
nodes can be found by following relationships (and, I presume, the LIKE
relationship type is only used to point to Product
nodes), so specifying their labels is superfluous and causes unnecessary work.
The profile results for the above query will have a DB Hits
value of 0
for all the Filter
steps (and in one case, the Filter
step will be eliminated entirely).