标准偏差加法和减法

时间:2015-12-17 15:39:19

标签: algorithm standard-deviation

  

是否有公式来计算偏差与其他数据集相加或相减的标准偏差?

示例:

Dataset1 (5 elements to count):
values: 5,10,15,20,25
mean: 15
Sum of Squared mean: 275 (5^2+10^2+...)/5
Population variance: 50
Population Standard deviation: 7,071067812
Population Max STD  22,07106781
Population Min STD  7,928932188
Dataset2 (5 elements to count):
values: 2,4,11,7,16
mean: 8
Sum of Squared mean: 89,2 (2^2+4^2+...)/5
Population variance: 25,2
Population Standard deviation: 5,019960159
Population Max STD  13,01996016
Population Min STD  2,980039841
Dataset3 (5 elements to count):
The elements are a sum of the previous dataset
values: 7,14,26,27,41
mean: 23 (<-- Ok, sum of the previous means)
Sum of Squared mean: 666,2
Population variance: 137,2
Population Standard deviation: 11,71324037
Population Max STD  34,71324037
Population Min STD  11,28675963

Data3的平均值很容易计算为数据平均值1 +数据平均值2

但是,......如何计算它们的值?

例如,知道平方和可用于计算方差。有没有办法使用偏向于Data1和Data2的公式直接计算数据的平方和?

如果没有,有没有办法计算Data3的方差,而不使用协方差? (这是因为,协方差将假设我必须执行另一次总和计算)。我更直接地考虑公式,而不是重新计算每个元素。

1 个答案:

答案 0 :(得分:0)

标准差,均值,方差等可以通过简单地保持一些计算的运行计数来计算。可以通过保持2个数据点的乘积的运行总和来添加集合的总和。 Ref

STD的计算对减法很敏感。建议保持sumxx, sumxy, sumyy更高的精度。以下使用longlong long。 FP实现可以使用doublelong double。除了将它们与运行的求和一起使用之外,这里没有深入表达更高精度问题的细节。

#include <stdio.h>
#include <math.h>

struct stat2 {
  long sumx;
  long sumy;
  long long sumxx;
  long long sumxy;
  long long sumyy;
  size_t count;
};

void stat2_add(struct stat2 *stat, long x, long y) {
  stat->sumx += x;
  stat->sumxx += 1LL * x * x;
  stat->sumy += y;
  stat->sumyy += 1LL * y * y;

  // This is the only extra reoccurring work needed to meet OP's goal
  stat->sumxy += 1LL * x * y; 

  stat->count++;
}

double stat2_avg(const struct stat2 *stat, int index) {
  switch (index) {
    case 'x':
      return 1.0 * stat->sumx / stat->count;
    case 'y':
      return 1.0 * stat->sumy / stat->count;
    default:
      return 1.0 * (stat->sumx + stat->sumy) / stat->count;
  }
}

double stat2_std(const struct stat2 *stat, int index) {
  double offset = 0.0;  // or 1.0 depending on STD model
  double var;
  switch (index) {
    case 'x':
      var = (stat->sumxx - 1.0 * stat->sumx * stat->sumx / stat->count)
          / (stat->count - offset);
      break;
    case 'y':
      var = (stat->sumyy - 1.0 * stat->sumy * stat->sumy / stat->count)
          / (stat->count - offset);
      break;
    default: {
      // SUM(x+y) = SUM(x) + SUM(y)
      double z = stat->sumx + stat->sumy;
      // SUM((x+y)*(x+y)) = SUM(x*x) + 2*SUM(x*y) + SUM(y*y)
      double zz = stat->sumxx + 2LL * stat->sumxy + stat->sumyy;
      var = (zz - 1.0 * z * z / stat->count) / (stat->count - offset);
    }
  }
  return sqrt(var);
}

void stat2_report(const struct stat2 *stat, const char *title) {
  printf("%s\n", title);
  printf("  x   Avg:%9f  STD:%f\n", stat2_avg(stat, 'x'), stat2_std(stat, 'x'));
  printf("  y   Avg:%9f  STD:%f\n", stat2_avg(stat, 'y'), stat2_std(stat, 'y'));
  printf("  x+y Avg:%9f  STD:%f\n", stat2_avg(stat, 'z'), stat2_std(stat, 'z'));
}

int main(void) {
  size_t i;
  struct stat2 A = { 0, 0, 0, 0, 0 };
  int dataA[] = { 5, 10, 15, 20, 25 };
  int dataB[] = { 2, 4, 11, 7, 16 };
  for (i = 0; i < 5; i++)
    stat2_add(&A, dataA[i], dataB[i]);
  stat2_report(&A, "A");
  return 0;
}

输出

A
  x   Avg:15.000000  STD:7.071068
  y   Avg: 8.000000  STD:5.019960
  x+y Avg:23.000000  STD:11.713240