我正在努力找出拥有获得职业资格所需的所有必要技能的用户数量。用户可以拥有许多技能,我希望每个工作都返回所有合格用户。
这是我当前的查询:
MATCH (:User)-[:has_skill]->(:Skill)<-[:requires]-(o:Occupation)
WITH DISTINCT o
MATCH (o)
WITH o, SIZE((o)-[:requires]->()) AS occupation_skill_count
MATCH (o)-[:requires]->(:Skill)<-[hs:has_skill]-(u:User)
WITH o, u, occupation_skill_count, count(hs) AS user_skill_count
WHERE occupation_skill_count = user_skill_count
WITH o.title as occupation_title, count(u) as users_count
RETURN occupation_title, users_count
但是,我担心我的查询效率不高,因为它超时(有超过60,000个职业,10,000个用户和2,500个技能)。我想知道是否有更好的方法来编写此查询。
我编写此查询的方法是,
这似乎适用于暂存环境,其中记录要少得多。然而,由于数据太多,它只会超时。有没有更好的方法来写这个?
答案 0 :(得分:0)
对于性能问题,有助于显示查询的PROFILE计划。如果您可以展开计划的所有元素并将其粘贴到说明中,那么可以帮助确定可以改进查询的位置。
由于您正在为所有职业执行此操作,因此它是批处理的理想选择。但是,由于批处理将无法返回计数(它用于写入操作),我们可以使用它将计数写入:占用节点,这样我们就可以在计算完这些数字后快速查询这些数字。 。此时,如果您想保留计算出的属性(可能包含计算时间的时间戳),或者只是报告它们并立即删除属性,则取决于您。
您需要APOC Procedures才能执行批处理操作。 apoc.periodic.iterate()
将是您选择的程序(您可以将batchSize调整为最适合您的方法)。我会在线添加评论。
CALL apoc.periodic.iterate(
// iterate in batches for all :Occupations
"MATCH (o:Occupation) RETURN o",
// for each occupation, get all skills in ascending order of skilled users
"MATCH (o)-[:requires]->(s:Skill)
WITH o, s, size((s)<-[:has_skill]-()) as skilledUserCount
WHERE skilledUserCount <> 0
ORDER BY skilledUserCount ASC
WITH o, collect(s) as skills
WITH o, head(skills) as first, tail(skills) as skills
// get users with all the required skills
// because of ordering, we start with the smallest set of skilled users
MATCH (first)<-[:has_skill]-(u)
WHERE ALL(skill in skills WHERE (skill)<-[:has_skill]-(u))
// now set this count of users with all skills to the occupation
WITH o, count(u) as skilledUsers
SET o.skilledUsers = skilledUsers
// uncomment next line to keep a timestamp of when this was last updated
// SET o.skilledUsersUpdated = timestamp()
",
{batchSize:1000, parallel:true, iterateList:true}) YIELD batches, total
RETURN batches, total
一旦完成,所有职业都应该拥有熟练的用户数量,以便于查询:
MATCH (o:Occupation)
RETURN o.title as occupation_title, o.skilledUsers as users_count