我遇到了一个问题,我怀疑这是因为我无法制定有效的CYPHER查询,而且普遍缺乏neo4j体验。
背景
我有一个相对较大的数据集,当发现互相喜欢的时候,似乎很窒息。用户和他们的第二学位朋友之间。
当前统计信息:
neo4j-sh (?)$ dbinfo -g "Primitive count"
{
"NumberOfNodeIdsInUse": 9343080,
"NumberOfPropertyIdsInUse": 25416540,
"NumberOfRelationshipIdsInUse": 47270718,
"NumberOfRelationshipTypeIdsInUse": 8
}
------
Numbers:
Users: ~ 2 million
Likes: ~ 7 million
Users Likes: ~ 22 million
索引:
neo4j-sh (?)$ schema
Indexes
ON :Employer(origin_id) ONLINE (for uniqueness constraint)
ON :Group(origin_id) ONLINE (for uniqueness constraint)
ON :Like(category) ONLINE
ON :Like(origin_id) ONLINE (for uniqueness constraint)
ON :Location(country_code) ONLINE
ON :Location(country) ONLINE
ON :Location(origin_id) ONLINE (for uniqueness constraint)
ON :School(origin_id) ONLINE (for uniqueness constraint)
ON :User(registered) ONLINE
ON :User(relationship_status) ONLINE
ON :User(interested_in) ONLINE
ON :User(gender) ONLINE
ON :User(age) ONLINE
ON :User(origin_id) ONLINE (for uniqueness constraint)
ON :User(uid) ONLINE (for uniqueness constraint)
Constraints
ON (user:User) ASSERT user.uid IS UNIQUE
ON (school:School) ASSERT school.origin_id IS UNIQUE
ON (user:User) ASSERT user.origin_id IS UNIQUE
ON (group:Group) ASSERT group.origin_id IS UNIQUE
ON (employer:Employer) ASSERT employer.origin_id IS UNIQUE
ON (like:Like) ASSERT like.origin_id IS UNIQUE
ON (location:Location) ASSERT location.origin_id IS UNIQUE
慢查询: http://pastebin.com/MPZ3aXCs
问题:
对于此用户,第一个查询在大约12秒内执行,返回909行。还是很慢。
对于该用户,第二个查询在大约70秒内执行。对我来说,当前的问题是,试图通过朋友的匹配朋友(第33行)的共同利益进行搜索会导致时间的急剧增加。我还注意到添加这个匹配似乎创建了第二个EAGER'分支'在个人资料中。在此期间,CPU绝对固定。
如果我后退并简单地匹配两个用户之间的共同兴趣,则在< 50毫秒。
neo4j-sh (?)$ PROFILE MATCH (u:User {origin_id:2043})-[:LIKES]->(l:Like)<-[:LIKES]-(u2:User {origin_id:1212817}) return l;
3 rows
ColumnFilter
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+--------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+--------------------------------------+
| ColumnFilter | 3 | 0 | | keep columns l |
| Filter | 3 | 0 | | NOT( UNNAMED31 == UNNAMED50) |
| TraversalMatcher | 3 | 1114 | | u2, UNNAMED50, u2, UNNAMED31, u2 |
+------------------+------+--------+-------------+--------------------------------------+
Total database accesses: 1114
我们目前正在寻求扩展此查询以匹配用户现在看似不可能的3度朋友。
我还应该注意到我在独立的AWS c3.xlarge(4个vCPU / 8GB RAM)上运行它,除了主机neo4j之外别无其他功能。服务器配置或多或少是标准默认值。如有必要,很乐意提供。
理想情况下,我希望在单个查询中返回此信息,因为之后会对其进行处理。
非常感谢任何优化这些查询的帮助。如果我在这里错过了任何关键信息,请告诉我。
修改:使用Neo4J 2.1.6
编辑2:
我对查询进行了一些更改,似乎将dbhits的数量减少了一半。查询所用的时间现已减少到约16秒。
此处提供了包含个人资料的新查询:http://pastebin.com/UyFi89H7
除了使用额外的标准来过滤朋友的朋友之外,我还能做出更多的优化吗?
答案 0 :(得分:1)
首先要提出一个非常详细的问题。
其次,通过查看Cypher查询的开头,我可以给你的建议是从一个小的起点开始,例如,首先匹配您的用户,然后使用WITH将其传递给下一步。然后检索他的位置,用WITH传递用户和位置。
正如您在第一个查询的配置文件中看到的那样,他将从Traversal Matcher开始,而不是从标签和属性索引中受益。
首次优化让您走上正轨:
PROFILE
MATCH (user:User {origin_id:138})
WITH user
MATCH (user)-[r:LIVES_IN]->(userLoc:Location), (user)-[fr:FRIENDS_WITH*2]->(fof:User)
WHERE
user.origin_id <> fof.origin_id
AND NOT (user)-[:FRIENDS_WITH]->(fof)
通过上述查询,他将使用索引来检索您的用户而不是遍历匹配器。
答案 1 :(得分:0)
您也可以尝试:
MATCH (u:User {origin_id:2043}),(u2:User {origin_id:1212817})
MATCH path = allShortestPaths((u1)-[:LIKES*..2]-(u2))
RETURN nodes(path)[1] as like
尝试尽可能早地将基数降低到最低限度, 而不是多次匹配每个fof,尝试首先聚合到一个fof实例然后匹配
PROFILE
MATCH (user:User {origin_id:138})-[:LIVES_IN]->(userLoc:Location)-[:IN_COUNTRY]->(country)
MATCH (user)-[fr:FRIENDS_WITH]->(friend:User)-[fofr:FRIENDS_WITH]->(fof:User)
WHERE (fof.dob_age <= 35 AND fof.dob_age >= 20)
WITH user, count(distinct friend) as mutual_friend_count, collect(distinct friend) as mutual_friends, fof,
(ABS(user.dob_age - fof.dob_age)) as age_diff, userLoc, country
WHERE (fof)-[:LIVES_IN]->(fofLoc:Location)-[:IN_COUNTRY]->(country)
RETURN
fof.origin_id as fof_origin_id,
fof.first_name as fof_first_name,
fof.last_name as fof_last_name,
fof.dob_age as fof_age,
user.dob_age as user_age,
userLoc.latitude as user_loc_latitude,
userLoc.longitude as user_loc_longitude,
fofLoc.name as fof_loc_name,
fofLoc.latitude as fof_loc_latitude,
fofLoc.longitude as fof_loc_longitude,
age_diff as age_diff,
mutual_friend_count, mutual_friends