我是Neo4j的新手,目前我正试图将约会网站改为POC。我有4GB的输入文件,看起来像波纹管格式。
这包含viewerId(男/女),来自他们已查看的ID列表的seenId。基于此历史文件,我需要在任何用户上线时提供建议。
输入文件:
viewerId viewedId
12345 123456,23456,987653
23456 23456,123456,234567
34567 234567,765678,987653
:
对于此任务,我尝试了以下方式,
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/hadoopuser/Neo-input " AS row
FIELDTERMINATOR '\t'
WITH row, split(row.viewedId, ",") AS viewedIds
UNWIND viewedIds AS viewedId
MERGE (p2:Persons2 {viewerId: row.viewerId})
MERGE (c2:Companies2 {viewedId: viewedId})
MERGE (p2)-[:Friends]->(c2)
MERGE (c2)-[:Sees]->(p2);
我的Cypher查询得到的结果是,
MATCH (p2:Persons2)-[r*1..3]->(c2: Companies2)
RETURN p2,r, COLLECT(DISTINCT c2) as friends
要完成此任务,需要3天。
我的系统配置:
Ubuntu -14.04
RAM -24GB
Neo4j配置:
neo4j.properties:
neostore.nodestore.db.mapped_memory=200M
neostore.propertystore.db.mapped_memory=2300M
neostore.propertystore.db.arrays.mapped_memory=5M
neostore.propertystore.db.strings.mapped_memory=3200M
neostore.relationshipstore.db.mapped_memory=800M
的Neo4j-wrapper.conf
wrapper.java.initmemory=12000
wrapper.java.maxmemory=12000
为了减少时间,我从以下链接中搜索并在互联网上获得一个想法,如批量导入器, https://github.com/jexp/batch-import
在该链接中,他们有node.csv,rels.csv文件,它们被导入到Neo4j中。我不知道他们是如何创建node.csv和rels.csv文件的,他们正在使用哪些脚本以及所有脚本。
任何人都可以给我示例脚本来为我的数据制作node.csv和rels.csv文件吗?
或者您能否提出任何建议,以便更快地导入和检索数据?
先谢谢。
答案 0 :(得分:1)
你不需要反向关系,只有一个足够好!
对于Import将您的堆(neo4j-wrapper.conf)配置为12G,将page-cache(neo4j.properties)配置为10G。
试试这个,应该在几分钟内完成。
create constraint on (p:Persons2) assert p.viewerId is unique;
create constraint on (p:Companies2) assert p.viewedId is unique;
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/hadoopuser/Neo-input " AS row
FIELDTERMINATOR '\t'
MERGE (p2:Persons2 {viewerId: row.viewerId});
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/hadoopuser/Neo-input " AS row
FIELDTERMINATOR '\t'
FOREACH (viewedId IN split(row.viewedId, ",") |
MERGE (c2:Companies2 {viewedId: viewedId}));
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/hadoopuser/Neo-input " AS row
FIELDTERMINATOR '\t'
WITH row, split(row.viewedId, ",") AS viewedIds
MATCH (p2:Persons2 {viewerId: row.viewerId})
UNWIND viewedIds AS viewedId
MATCH (c2:Companies2 {viewedId: viewedId})
MERGE (p2)-[:Friends]->(c2);
对于关系合并,如果您有一些公司拥有数十万到数百万的观看次数,您可能想要使用它:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/hadoopuser/Neo-input " AS row
FIELDTERMINATOR '\t'
WITH row, split(row.viewedId, ",") AS viewedIds
MATCH (p2:Persons2 {viewerId: row.viewerId})
UNWIND viewedIds AS viewedId
MATCH (c2:Companies2 {viewedId: viewedId})
WHERE shortestPath((p2)-[:Friends]->(c2)) IS NULL
CREATE (p2)-[:Friends]->(c2);
您希望通过检索所有人和所有最多3个级别的公司之间的交叉产品来实现什么目标?这些可能是数万亿的路径?
通常,您希望了解单个个人或公司。
EG。对于123456,所有人都看这个公司的是12345,23456,那么这些人的观点是12345 123456,23456,987653 23456 23456,123456,234567那么我需要向公司-123456推荐给23456,987653, 23456,234567结果独特(最终结果)23456,987653,234567
match (c:Companies2)<-[:Friends]-(p1:Persons2)-[:Friends]->(c2:Companies2)
where c.viewedId = 123456
return distinct c2.viewedId;
对所有公司而言,这可能有所帮助:
match (c:Companies2)<-[:Friends]-(p1:Persons2)
with p1, collect(c) as companies
match (p1)-[:Friends]->(c2:Companies2)
return c2.viewedId, extract(c in companies | c.viewedId);