Question

在阅读https://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/之后，我目前正在使用MovieLens 20m数据集进行电影推荐。节点电影通过hashare关系连接到Genre，节点电影通过hasRating关系连接到User。我正在尝试检索与查询（例如玩具总动员）具有最高共同评价（共同评价> 3.0）的所有电影，这些查询与Toy Story共享所有流派。这是我的Cypher查询：

MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]-(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)-[:hasGenre]->(genre) 
WITH  inputGenres,  r, o, movie, COLLECT(genre) AS genres 
WHERE ALL(h in inputGenres where h in genres) and (r.rating>3 and o.rating>3)  
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

但是，我的系统似乎无法处理它（使用16GB的RAM，Core i7 4th gen和SSD）。当我运行查询时，它达到RAM的97％，然后Neo4j意外关闭（可能是由于堆大小，或者是由于RAM大小）。

我是否使查询正确？我是Neo4j的新手，所以我可能错误地进行了查询。
请提出如何优化此类查询的建议？
如何优化Neo4j，使其能够根据查询使用系统规格处理大型数据集？

谢谢。

Answer 1

首先，只需匹配我们需要的内容，然后在WHERE中处理其余内容，即可简化Cypher以便进行更有效的计划（这样，可以在匹配时进行过滤）

MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)
WHERE (r.rating>3 and o.rating>3) AND ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

现在，如果您不介意将数据添加到图形中以查找所需的数据，则您可以做的另一件事是将查询拆分为几个小部分，然后“缓存”结果。例如

// Cypher 1
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (movie:Movie)
WHERE ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
// Merge so that multiple runs don't create extra copies
MERGE (inputMovie)-[:isLike]->(movie)

// Cypher 2
MATCH (movie:Movie)<-[r:hasRating]-(user)
WHERE r.rating>3
// Merge so that multiple runs don't create extra copies
MERGE (user)-[:reallyLikes]->(movie)

// Cypher 3
MATCH (inputMovie:Movie{movieId: 1})<-[:reallyLikes]-(user)-[:reallyLikes]->(movie:Movie)<-[:isLike]-(inputMovie)
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

针对大数据集的电影推荐优化密码查询

1 个答案: