如何在neo4j中使用平均函数与集合

时间:2015-12-22 19:08:39

标签: neo4j cypher

我想计算两个向量的协方差作为集合 A = [1,2,3,4] B = [5,6,7,8]

Cov(A,B)= Sigma [(ai-AVGa)*(bi-AVGb)] /(n-1)

协方差计算的问题是:

1)我不能拥有嵌套聚合函数 我写的时候

SUM((ai-avg(a)) * (bi-avg(b)))

2)或者在另一种形状中,我如何用一个简化提取两个集合,如:

REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))

3)如果无法在oe中提取两个集合,则减少如何将它们的值相关联以在分离时计算协方差

REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))

我的意思是我可以编写嵌套的reduce吗?

4)有什么方法可以“放松”,“提取”

感谢您提前获取任何帮助。

4 个答案:

答案 0 :(得分:6)

cybersam的答案完全没问题,但是如果你想避免使用双UNWIND产生的n^2笛卡尔积,你可以这样做:

WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
     REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
     SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;

编辑:

没有打电话给任何人,但让我详细说明你为什么要避免https://stackoverflow.com/a/34423783/2848578中的双重UNWIND。就像我在下面所说的那样,在Cypher中UNWINDing k length-n集合会产生n^k行。因此,让我们采用两个长度为3的集合来计算协方差。

> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
   | aa | bb
---+----+----
 1 |  1 |  4
 2 |  1 |  5
 3 |  1 |  6
 4 |  2 |  4
 5 |  2 |  5
 6 |  2 |  6
 7 |  3 |  4
 8 |  3 |  5
 9 |  3 |  6

现在我们有n^k = 3^2 = 9行。此时,取这些标识符的平均值意味着我们将取9个值的平均值。

> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
   | AVG(aa) | AVG(bb)
---+---------+---------
 1 |     2.0 |     5.0

同样如下所述,这并不会影响答案,因为数字的重复矢量的平均值将始终相同。例如,{1,2,3}的平均值等于{1,2,3,1,2,3}的平均值。 n的小值可能无关紧要,但当您开始获得更大的n值时,您会开始看到性能下降。

假设您有两个长度为1000的向量。用双UNWIND计算每个的平均值:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
   | AVG(aa) | AVG(bb)
---+---------+---------
 1 |   500.0 |  1500.0

714 ms

明显慢于使用REDUCE:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
       REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
   | e_a   | e_b   
---+-------+--------
 1 | 500.0 | 1500.0

4 ms

为了将它们整合在一起,我将在长度为1000的向量上完整地比较两个查询:

> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
 covariance;
   | covariance
---+------------
 1 |    83583.5

9105 ms

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
     REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
          SIZE(a) AS n, a, b
          RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
   | cov    
---+---------
 1 | 83583.5

33 ms

答案 1 :(得分:5)

[EDITED]

这应该根据您的样本输入计算协方差(根据您的公式):

WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;

n很小时,这种方法就可以了,就像原始样本数据一样。

然而,正如@NicoleWhite和@jjaderberg指出的那样,当n不小时,这种方法效率低下。 @NicoleWhite的答案是一个优雅的通用解决方案。

答案 2 :(得分:3)

您如何到达馆藏ABavg函数是聚合函数,不能在REDUCE上下文中使用,也不能应用于集合。你应该在达到那个点之前计算你的平均值,但究竟如何做到最好取决于你如何得出两个值集合。如果您的个人结果项目已经collect获得AB,那么就可以使用avg了。例如:

WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg

和两个集合

WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg

一旦你有两个平均值,我们称他们为aAvgbAvg,你可以做两个集合,aCollbColl

RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance

答案 3 :(得分:0)

非常感谢Dears,不过我想知道哪一个最有效率

1)嵌套展开和范围内减少 - > @cybersam

2)嵌套减少 - > @Nicole White

3)嵌套使用(重置查询) - > @jjaderberg

但重要的问题是:

为什么计算与实际和实际计算之间存在错误和差异。

我的意思是你的协方差等于= 1.6666666666666667

但在现实世界中,协方差等于= 1.25

请检查:https://www.easycalculation.com/statistics/covariance.php

矢量X:[1,2,3,4] 矢量Y:[5,6,7,8]

enter image description here

enter image description here

我认为这种差异是因为某些计算不考虑(n-1)为除数而不是(n-1),只是它们使用n。因此,当我们将除数从n-1增加到n时,结果将从1.6减少到1.25。

enter image description here