我有两个要从csv文件加载的节点(Job和JobSeeker)。 我还有一个名为兴趣的实体,它定义了这两个实体之间的关系。兴趣具有Job和JobSeeker的ID,通过它们我可以创建SHOW_INTEREST关系。下文提及的实体总数
Job - 4k
JobSeeker - 80k
Interest - 4.4 Million
我正在尝试创建一个用于协同过滤的查询(前5名工作被相似的JobSeekers引起了兴趣),但是这花了很多时间来执行。有什么可以帮助我找出此查询的问题吗?在所有查询下方提及。
CREATE INDEX ON :Job(_id)
CREATE INDEX ON :JobSeeker(_id)
// Job Node
LOAD CSV WITH HEADERS FROM "job.csv" AS j1
MERGE (job:Job {title : "Job",_id:j1._id, jobTitle:j1.title , company: j1.companyName })
// JobSeeker Node
LOAD CSV WITH HEADERS FROM "jobSeeker.csv" AS row
CREATE (p:JobSeeker { name : "JobSeeker",_id:row._id,email:row.email, role: row.role,rating:row.rating, firstNamw : row.firstName,sonicScore:row.coreOfJobSeeker
,hiredCount:row.numberOfTimesHired})
// Interest Relationship between Job and JobSeeker
LOAD CSV WITH HEADERS FROM "interest.csv" AS row WITH row LIMIT 10000
MATCH (jobSeeker:JobSeeker),(job:Job)
WHERE jobSeeker._id = row.jobSeekerId AND job._id = row.jobId
CREATE (jobSeeker)-[r:SHOWN_INTEREST]->(job)
RETURN type(r)
// Collabrative Filtering
// Top 5 Jobs shown interest by similar JobSeekers
MATCH (s1:JobSeeker{_id:"579c914fe4b00d9fa5d60fb0"})-[:SHOWN_INTEREST]->(j1:Job)<-[:SHOWN_INTEREST]-(s2:JobSeeker),
(s2:JobSeeker)-[:SHOWN_INTEREST]->(j2:Job)
WHERE NOT (s1)-[:SHOWN_INTEREST]->(j2)
WITH j2 , count(distinct j2) as frequency
ORDER BY frequency DESC LIMIT 5
RETURN j2.jobTitle , frequency
答案 0 :(得分:0)
收集s1感兴趣的工作可能会更容易,这样我们可以使用集合过滤而不是扩展来最后清除那些工作:
MATCH (s1:JobSeeker{_id:"579c914fe4b00d9fa5d60fb0"})-[:SHOWN_INTEREST]->(j1)
WITH s1, collect(j1) as interestedJobs
MATCH (s1)-[:SHOWN_INTEREST*3]-(j2:Job)
WHERE NOT j2 in interestedJobs
WITH j2 , count(j2) as frequency
ORDER BY frequency DESC
LIMIT 5
RETURN j2.jobTitle , frequency