是否有公式来计算偏差与其他数据集相加或相减的标准偏差?
示例:
Dataset1 (5 elements to count): values: 5,10,15,20,25 mean: 15 Sum of Squared mean: 275 (5^2+10^2+...)/5 Population variance: 50 Population Standard deviation: 7,071067812 Population Max STD 22,07106781 Population Min STD 7,928932188
Dataset2 (5 elements to count): values: 2,4,11,7,16 mean: 8 Sum of Squared mean: 89,2 (2^2+4^2+...)/5 Population variance: 25,2 Population Standard deviation: 5,019960159 Population Max STD 13,01996016 Population Min STD 2,980039841
Dataset3 (5 elements to count):
The elements are a sum of the previous dataset
values: 7,14,26,27,41
mean: 23 (<-- Ok, sum of the previous means)
Sum of Squared mean: 666,2
Population variance: 137,2
Population Standard deviation: 11,71324037
Population Max STD 34,71324037
Population Min STD 11,28675963
Data3的平均值很容易计算为数据平均值1 +数据平均值2
但是,......如何计算它们的值?
例如,知道平方和可用于计算方差。有没有办法使用偏向于Data1和Data2的公式直接计算数据的平方和?
如果没有,有没有办法计算Data3的方差,而不使用协方差? (这是因为,协方差将假设我必须执行另一次总和计算)。我更直接地考虑公式,而不是重新计算每个元素。
答案 0 :(得分:0)
标准差,均值,方差等可以通过简单地保持一些计算的运行计数来计算。可以通过保持2个数据点的乘积的运行总和来添加集合的总和。 Ref
STD的计算对减法很敏感。建议保持sumxx, sumxy, sumyy
更高的精度。以下使用long
和long long
。 FP实现可以使用double
和long double
。除了将它们与运行的求和一起使用之外,这里没有深入表达更高精度问题的细节。
#include <stdio.h>
#include <math.h>
struct stat2 {
long sumx;
long sumy;
long long sumxx;
long long sumxy;
long long sumyy;
size_t count;
};
void stat2_add(struct stat2 *stat, long x, long y) {
stat->sumx += x;
stat->sumxx += 1LL * x * x;
stat->sumy += y;
stat->sumyy += 1LL * y * y;
// This is the only extra reoccurring work needed to meet OP's goal
stat->sumxy += 1LL * x * y;
stat->count++;
}
double stat2_avg(const struct stat2 *stat, int index) {
switch (index) {
case 'x':
return 1.0 * stat->sumx / stat->count;
case 'y':
return 1.0 * stat->sumy / stat->count;
default:
return 1.0 * (stat->sumx + stat->sumy) / stat->count;
}
}
double stat2_std(const struct stat2 *stat, int index) {
double offset = 0.0; // or 1.0 depending on STD model
double var;
switch (index) {
case 'x':
var = (stat->sumxx - 1.0 * stat->sumx * stat->sumx / stat->count)
/ (stat->count - offset);
break;
case 'y':
var = (stat->sumyy - 1.0 * stat->sumy * stat->sumy / stat->count)
/ (stat->count - offset);
break;
default: {
// SUM(x+y) = SUM(x) + SUM(y)
double z = stat->sumx + stat->sumy;
// SUM((x+y)*(x+y)) = SUM(x*x) + 2*SUM(x*y) + SUM(y*y)
double zz = stat->sumxx + 2LL * stat->sumxy + stat->sumyy;
var = (zz - 1.0 * z * z / stat->count) / (stat->count - offset);
}
}
return sqrt(var);
}
void stat2_report(const struct stat2 *stat, const char *title) {
printf("%s\n", title);
printf(" x Avg:%9f STD:%f\n", stat2_avg(stat, 'x'), stat2_std(stat, 'x'));
printf(" y Avg:%9f STD:%f\n", stat2_avg(stat, 'y'), stat2_std(stat, 'y'));
printf(" x+y Avg:%9f STD:%f\n", stat2_avg(stat, 'z'), stat2_std(stat, 'z'));
}
int main(void) {
size_t i;
struct stat2 A = { 0, 0, 0, 0, 0 };
int dataA[] = { 5, 10, 15, 20, 25 };
int dataB[] = { 2, 4, 11, 7, 16 };
for (i = 0; i < 5; i++)
stat2_add(&A, dataA[i], dataB[i]);
stat2_report(&A, "A");
return 0;
}
输出
A
x Avg:15.000000 STD:7.071068
y Avg: 8.000000 STD:5.019960
x+y Avg:23.000000 STD:11.713240