我想计算两个向量的协方差,如集合A = [1,2,3,4] B = [5,6,7,8]
Cov(A,B)= Sigma [(ai-AVGa)*(bi-AVGb)] /(n-1)
协方差计算的问题是:
1)我写的时候不能有嵌套的聚合函数
SUM((ai-avg(a)) * (bi-avg(b)))
2)或者在另一种形状中,如何用一个缩小提取两个集合,例如:
REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))
3)如果无法在oe中提取两个集合,那么在它们分离时减少如何将它们的值相关联以计算协方差
REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a))) REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))
我的意思是我可以编写嵌套的reduce吗?
4)有什么方法可以"放松","提取"
谢谢你提前获得任何帮助.
cybersam的答案完全没问题,但是如果你想避免n^2
使用双UNWIND 产生的笛卡尔积,你可以这样做:
WITH [1,2,3,4] AS a, [5,6,7,8] AS b WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b, SIZE(a) AS n, a, b RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;
编辑:
没有要求任何人,但是让我更多地讨论你为什么会想避免在双UNWIND /sf/ask/17360801/.就像我在下面说的那样,在Cypher中UNWINDing k length-n集合会产生n^k
行.因此,让我们采用两个长度为3的集合来计算协方差.
> WITH [1,2,3] AS a, [4,5,6] AS b UNWIND a AS aa UNWIND b AS bb RETURN aa, bb; | aa | bb ---+----+---- 1 | 1 | 4 2 | 1 | 5 3 | 1 | 6 4 | 2 | 4 5 | 2 | 5 6 | 2 | 6 7 | 3 | 4 8 | 3 | 5 9 | 3 | 6
现在我们有了n^k = 3^2 = 9
行.在这一点上,取这些标识符的平均值意味着我们取9个值的平均值.
> WITH [1,2,3] AS a, [4,5,6] AS b UNWIND a AS aa UNWIND b AS bb RETURN AVG(aa), AVG(bb); | AVG(aa) | AVG(bb) ---+---------+--------- 1 | 2.0 | 5.0
同样如下所述,这不会影响答案,因为重复的数字向量的平均值将始终相同.例如,{1,2,3}的平均值等于{1,2,3,1,2,3}的平均值.对于小值,这可能是无关紧要的n
,但是当你开始获得更大的值时,n
你会开始看到性能下降.
假设你有两个长度为1000的向量.使用双UNWIND计算每个的平均值:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b UNWIND a AS aa UNWIND b AS bb RETURN AVG(aa), AVG(bb); | AVG(aa) | AVG(bb) ---+---------+--------- 1 | 500.0 | 1500.0
714毫秒
比使用REDUCE慢得多:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b; | e_a | e_b ---+-------+-------- 1 | 500.0 | 1500.0
4毫秒
为了将它们整合在一起,我将在长度为1000的向量上完整地比较两个查询:
> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb UNWIND aa AS a UNWIND bb AS b WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance; | covariance ---+------------ 1 | 83583.5
9105毫秒
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a, REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b, SIZE(a) AS n, a, b RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i ] - e_b))) / (n - 1) AS cov; | cov ---+--------- 1 | 83583.5
33毫秒
[EDITED]
这应该根据您的样本输入计算协方差(根据您的公式):
WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb UNWIND aa AS a UNWIND bb AS b WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;
这种方法n
很小,就像原始样本数据一样.
但是,正如@NicoleWhite和@jjaderberg指出的那样,当n
不小的时候,这种方法效率会很低.@NicoleWhite的答案是一个优雅的通用解决方案.