优化查询,从大型数据集计算两个节点之间的Jackard相似性

时间:2015-03-20 08:21:45

标签: neo4j cypher

我有数十亿个节点标记为Profile(:Profile {member_id,name,gender}),我需要从中计算它们之间的Jaccard索引。创建相似关系并将索引指定为属性。 男性配置文件节点与女性配置文件节点之间存在联系关系,反之亦然。

以下是CQL:

索引性别。

MATCH (u1:Profile {gender:"Male"}), (u2:Profile {gender:"Male"}) WHERE u1 <> u2
MATCH (u1)-[:CONTACTED]->(u3:Profile {gender:"Female"})<-[:CONTACTED]-(u2) WITH u1, u2, count(u3.member) as intersect
MATCH (u1)-[:CONTACTED]->(u1_f:Profile {gender:"Female"}) WITH u1, u2, intersect, collect(DISTINCT u1_f.member) AS coll1
MATCH (u2)-[:CONTACTED]->(u2_f:Profile {gender:"Female"}) WITH u1, u2, collect(DISTINCT u2_f.member) AS coll2, coll1, intersect
WITH u1, u2, intersect, coll1, coll2, length(coll1 + filter(x IN coll2 WHERE NOT x IN coll1)) as union
Where (1.0*intersect/union) > 0
CREATE Unique (u1)-[:SIMILARITY {score: (1.0*intersect/union)}]-(u2);

如果我以5的限制执行此操作,则需要大约5分钟才能产生完全不可行的结果。 我可以做些什么来加快执行时间,因为这是我项目的重要部分?

我认为下面的内容会起作用,但会让情况变得更糟。

为成员创建约束。

LOAD CSV WITH HEADERS FROM "file:{path_to_csv}/member_to_member.csv" AS row
MATCH (u1:Profile {member: row.sentby}), (u2:Profile {gender:"Male"}) WHERE u1 <> u2 AND row.status = "Contacted" AND row.sentbygender = "Male"
MATCH (u1)-[:CONTACTED]->(u3:Profile {member: row.recdby})<-[:CONTACTED]-(u2) WITH row, u1, u2, count(u3.member) as intersect
//WHERE intersect>0
MATCH (u1)-[:CONTACTED]->(u1_f:Profile {member: row.recdby}) WITH row, u1, u2, intersect, collect(DISTINCT u1_f.member) AS coll1
MATCH (u2)-[:CONTACTED]->(u2_f:Profile {member: row.recdby}) WITH row, u1, u2, collect(DISTINCT u2_f.member) AS coll2, coll1, intersect
WITH u1, u2, intersect, coll1, coll2, length(coll1 + filter(x IN coll2 WHERE NOT x IN coll1)) as union
return u1.member, u2.member, (1.0*intersect/union) as score limit 5;

member_to_member.csv

sentby,sentbygender,recdby,recdbygender,date_of_contact,status
OSH34878034,Male,angella,Female,2013-11-12,Contacted
OSH34878034,Male,AnshuSharma,Female,2013-11-12,Contacted
OSH34878034,Male,GSH26933499,Female,2013-11-12,Contacted
OSH34878034,Male,4SH00112696,Female,2013-11-12,Contacted
OSH34878034,Male,0308heinz,Female,2013-11-12,Contacted
OSH34878034,Male,8SH93301323,Female,2013-11-12,Contacted
OSH34878034,Male,098w,Female,2013-11-12,Contacted

资料来源: http://www.lyonwj.com/twizzard-a-tweet-recommender-system-using-neo4j/

注意:以上查询仅查找男性 - &gt;男性相似性

由于

2 个答案:

答案 0 :(得分:3)

使用Java而不是Cypher。

你的第一行已经创建了10亿平方行。

我可能会为所有配置文件创建一个大长数组

我将3个计数器折叠成一个长(位掩码)

然后回顾一下女性的所有联系关系(我也会将性别推广到标签)

并且每个rel通过node-id索引找到合适的端节点条目,并递增3个计数器中的一个。

那应该给你一个包含原始数字的数组,你可以计算结果。

答案 1 :(得分:0)

上述查询在Neo4j 2.2.0中表现良好。