我是公司的服务器工程师,提供约会服务。 目前我正在为我们的新推荐引擎构建PoC。 我尝试使用neo4j。但是这个数据库的性能不能满足我们的需求。 我有强烈的感觉,我做错了,neo4j可以做得更好。 那么有人可以给我一个建议,如何提高我的Cypher查询的性能或如何以正确的方式调整neo4j? 我正在使用neo4j-enterprise-2.3.1,它运行在带有Amazon Linux的c4.4xlarge实例上。 在我们的数据集中,每个用户可以与其他用户有4种类型的关系 - LIKE,DISLIKE,BLOCK和MATCH。 他还有像countryCode,生日和性别这样的属性。
我使用neo4j-import工具将所有用户和关系从RDBMS导入到neo4j。 因此每个用户都是一个具有属性的节点,每个引用都是一种关系。
neo4j-import工具的报告说:
2 558 667 节点,
1 674 714 539 属性和
1 664 532 288 关系
是进口的。
所以它是巨大的数据库:-)在我们的例子中,一些节点最多可以有30 000个传出关系..
我在neo4j中制作了3个索引:
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
然后我尝试使用此查询构建在线推荐引擎:
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
这是其中一个用户的执行计划: plan
当我对用户列表执行此查询时,我得到了结果:
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
因此,对于实时建议来说,即使是最快的也是太慢了。
你能告诉我我做错了吗?
感谢。
答案 0 :(得分:2)
我构建了一个非托管扩展,看看我是否能比Cypher做得更好。你可以在这里抓住它=&gt; https://github.com/maxdemarzi/social_dna
这是第一次拍摄,我们可以采取一些措施来加快速度。我们可以预先计算/保存类似的用户,在这里和那里缓存内容,以及随机的其他技巧。试一试,让我们知道它是怎么回事。
此致 最大
答案 1 :(得分:0)
如果我正确阅读,它会通过userId
找到用户的所有匹配项,并根据您的各种条件单独查找用户的所有匹配项。然后它找到了他们聚集在一起的所有地方。
由于你有一个案例,你从左边开始只有一个节点,我的猜测是我们可以通过跟踪路径然后过滤它通过关系遍历获得的内容来获得更好的服务。
让我们看看这样的开始对你有用:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
答案 2 :(得分:0)
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the ///<reference path="~/Scripts/jasmine/jasmine.js"/>
///<reference path="../../../../../../../../Scripts/angular.min.js"/>
///<reference path="../../../../../../../../Scripts/angular-mocks.js"/>
///<reference path="../../../../../../../../../Web.Payments/Areas/Admin/Scripts/dashboard/app.js"/>
///<reference path="../../../../../../../../../Web.Payments/Areas/Admin/Scripts/dashboard/payments/controllers/Controller.js"/>
describe('Controller: public/MakePaymentCtrl', function() {
var $controller;
beforeEach(module('AdminDashboard'));
beforeEach(inject(function(_$controller_){
$controller = _$controller_;
$controller('MakePaymentCtrl');
}));
describe("controller.addPayment", function () {
it("Contains spec with expectation", function () {
expect(true).toBe(true);
});
});
});
filter, Cypher uses an index seek with the similar:User(birthday)
index (and additional tests for :User(birthday)
and countryCode
) to find all possible DB matches for gender
. Let's call that large set of similar
nodes similar
.
Only after finding A
does the query filter to see which of those nodes are actually connected to A
, as specified by your me
pattern.
Now, if there are relatively few MATCH
to me
paths (as specified by the similar
pattern, but without considering its MATCH
clause) as compared to the size of WHERE
-- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the A
label from :User
(since I presume they are probably all going to be users anyway, in your data model), and also remove the similar
clause. In this case, not using the index for USING INDEX similar:User(birthday)
may actually be faster for you, since you will only be using the similar
clause on a relatively small set of nodes.
The same considerations also apply to the WHERE
node.
Of course, this all has to be verified by testing on your actual data.