针对大数据集的电影推荐优化密码查询

时间:2018-10-30 13:32:02

标签: performance neo4j cypher

在阅读https://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/之后,我目前正在使用MovieLens 20m数据集进行电影推荐。节点电影通过hashare关系连接到Genre,节点电影通过hasRating关系连接到User。我正在尝试检索与查询(例如玩具总动员)具有最高共同评价(共同评价> 3.0)的所有电影,这些查询与Toy Story共享所有流派。这是我的Cypher查询:

MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]-(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)-[:hasGenre]->(genre) 
WITH  inputGenres,  r, o, movie, COLLECT(genre) AS genres 
WHERE ALL(h in inputGenres where h in genres) and (r.rating>3 and o.rating>3)  
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

但是,我的系统似乎无法处理它(使用16GB的RAM,Core i7 4th gen和SSD)。当我运行查询时,它达到RAM的97%,然后Neo4j意外关闭(可能是由于堆大小,或者是由于RAM大小)。

  1. 我是否使查询正确?我是Neo4j的新手,所以我可能错误地进行了查询。
  2. 请提出如何优化此类查询的建议?
  3. 如何优化Neo4j,使其能够根据查询使用系统规格处理大型数据集?

谢谢。

1 个答案:

答案 0 :(得分:0)

首先,只需匹配我们需要的内容,然后在WHERE中处理其余内容,即可简化Cypher以便进行更有效的计划(这样,可以在匹配时进行过滤)

MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)
WHERE (r.rating>3 and o.rating>3) AND ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

现在,如果您不介意将数据添加到图形中以查找所需的数据,则您可以做的另一件事是将查询拆分为几个小部分,然后“缓存”结果。例如

// Cypher 1
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (movie:Movie)
WHERE ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
// Merge so that multiple runs don't create extra copies
MERGE (inputMovie)-[:isLike]->(movie)

// Cypher 2
MATCH (movie:Movie)<-[r:hasRating]-(user)
WHERE r.rating>3
// Merge so that multiple runs don't create extra copies
MERGE (user)-[:reallyLikes]->(movie)

// Cypher 3
MATCH (inputMovie:Movie{movieId: 1})<-[:reallyLikes]-(user)-[:reallyLikes]->(movie:Movie)<-[:isLike]-(inputMovie)
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC