Question

正在使该函数计算测量值数组中离群值的数量。

已经给出了中位数的计算功能。

如果测量值超出[0.5 *中位数至1.5 *中位数]范围，则它是异常值，因此应将其丢弃。所以我尽了最大努力。我只是想知道如何摆脱原始数组中的异常值。我做了新的数组来存储范围内的数字。返回值是分配数据。

task1_main.c

#include<stdio.h>
#include<stdlib.h>
#include "task1.c"

int main()
{
int i, size1, size2;

// reading the number of measurements in group1 
scanf("%d", &size1);        
float *measurements1 = malloc(size1*sizeof(float));
// reading the measurements in group1   
for(i=0; i<size1; i++)
scanf("%f", measurements1+i);

// reading the number of measurements in group2 
scanf("%d", &size2);        
float *measurements2 = malloc(size2*sizeof(float));
// reading the measurements in group1   
for(i=0; i<size2; i++)
scanf("%f", measurements2+i);



float median1 = sort_and_find_median(measurements1, size1);
int new_size1;
float *measurements1_wo_outliers = discard_outliers(measurements1, size1, median1, &new_size1);

float median2 = sort_and_find_median(measurements2, size2);
int new_size2;
float *measurements2_wo_outliers = discard_outliers(measurements2, size2, median2, &new_size2);

// writing measurements for group1 after discarding the outliers
printf("%d\n", new_size1);
for(i=0; i<new_size1; i++)
printf("%.2f\n", measurements1_wo_outliers[i]);

printf("\n");
// writing measurements for group2 after discarding the outliers
printf("%d\n", new_size2);
for(i=0; i<new_size2; i++)
printf("%.2f\n", measurements2_wo_outliers[i]);


free(measurements1);
free(measurements2);
free(measurements1_wo_outliers);
free(measurements2_wo_outliers);
return 0;
}

task1.c

// function to sort the array in ascending order
float sort_and_find_median(float *measurements , int size)
{
  int i=0 , j=0;
  float temp=0;

  for(i=0 ; i<size ; i++)
    {
      for(j=0 ; j<size-1 ; j++)
    {
      if(measurements[j]>measurements[j+1])
        {
          temp        = measurements[j];
          measurements[j]    = measurements[j+1];
          measurements[j+1]  = temp;
        }
    }
    }

  return measurements[size/2];
}

float *discard_outliers(float *measurements, int size, float median, int *new_size)
{

  //float number_of_outliers[0];
  int i= 0;
  for(i = 0; i<size; i++){
    if((measurements[i] < (0.5*median)) && (measurements[i] > (1.5*median))){
      number_of_outliers[i] = measurements[i];
    }

  }


  *new_size = size - number_of_outliers;
  //to creates a new array of length *newsize using malloc 
  *measurements_wo_outliers = malloc( (*new_size) * sizeof(float) );

}

让我们假设第1组和第2组分别有3和4位患者。假设第1组和第2组的度量分别为{45.0、23.15、11.98}和{2.45、11.0、12.98、77.80}。
Measurements.txt的内容将为：

3

45.0

23.15

11.98

4

2.45

11.0

12.98

77.80

mesurements.txt是

25 23.0 21.5 27.6 2.5 19.23 21.0 23.5 24.6 19.5 19.23 26.01 22.5 24.6 20.15 18.23 19.73 22.25 26.6 45.5 5.23 18.0 24.5 23.26 22.5 18.93

20 11.12 10.32 9.91 14.32 12.32 20.37 13.32 11.57 2.32 13.32 11.22 12.32 10.91 8.32 14.56 10.16 35.32 12.91 12.58 13.32

和Expected_measurements如下：

22 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15 21.00 21.50 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60 24.60 26.01 26.60 27.60

17 8.32 9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32 12.32 12.58 12.91 13.32 13.32 13.32 14.32 14.56

Answer 1

这是压缩阵列，删除异常值然后调整大小的基本方法。

首先，我注意到您测试异常值的逻辑是错误的。度量值不能小于0.5*median并大于1.5*median ...除非median为负。让我们通过以下两种方法来清理它们：

// Choose stable lower and upper bounds
const float low =  (median < 0.f ? 1.5f : 0.5f) * median;
const float high = (median < 0.f ? 0.5f : 1.5f) * median;

这可以确保low <= high 始终（除非low或high最终以NaN结尾）。

现在，您需要删除异常值。执行此操作的最简单方法是保留第二个索引，该索引记录您到目前为止已看到多少非异常值。遍历数组，如果发现任何异常值，则还可以随即对值进行随机排序。

// Remove outliers
int num_clean = 0;
for(int i = 0; i < size; i++)
{
    float value = measurements[i];
    if(value >= low && value <= high)
    {
        ++num_clean;
        if (i != num_clean)
            measurements[num_clean] = value;
    }
}

最后，num_clean表示剩余的值数。是否调整数组大小由您决定。您可以使用以下逻辑：

// Resize array
if (num_clean < size)
{
    float *new_measurements = realloc(measurements, num_clean * sizeof float);
    if (new_measurements)
        measurements = new_measurements;
    *new_size = num_clean;
}

请注意，在num_clean最终为0的情况下，您可能需要一些额外的处理。您必须决定是否释放数组。在上面，对realloc失败的情况也进行了静默处理-我们将保留原始数组指针，但更新new_size。

如果您不太担心多余的内存，最好完全避免重新分配。只需返回干净样本的数量，并保留阵列末尾的所有剩余内存即可。

Answer 2

除了当前答案外，您还有很多问题，但是离群值识别的问题是您使用的是'&&'而不是'||'，这会阻止由于测试条件而发现任何异常值总是评估FALSE，例如

if((measurements[i] < (0.5*median)) && (measurements[i] > (1.5*median))){

（数组元素不能同时小于 (0.5*median)和大于 (1.5*median)）

除了注释和@paddy's答案中指出的异常值识别之外，您无需在异常值删除功能中进行复制或分配。相反，通过使用memmove移除离群值，将离群值上方的所有元素向下拖曳来移除离群值，然后从函数返回之前，如果离群值被删除，您可以（可选）在最后一次realloc调整分配大小。

（除非您在内存受限的嵌入式系统上工作或要处理数百万个元素，否则实际上不需要）

整理移除函数并从main()传递数组的地址，以允许在函数中进行重新分配而不必分配返回值，您可以执行以下操作：

/* remove outliers from array 'a' given 'median'.
 * takes address of array 'a', address of number of elements 'n',
 * and median 'median' to remove outliers. a is reallocated following
 * removal and n is updated to reflect the number of elements that
 * remain. returns pointer to reallocated array on success, NULL otherwise.
 */
double *rmoutliers (double **a, size_t *n, double median)
{
    size_t i = 0, nelem = *n;   /* index, save initial numer of elements */

    while (i < *n)  /* loop over all elements indentifying outliers */
        if ((*a)[i] < 0.5 * median || (*a)[i] > 1.5 * median) {
            if (i < *n - 1)     /* if not end, use memmove to remove */
                memmove (&(*a)[i], &(*a)[i+1], 
                        (*n - i + 1) * sizeof **a);
            (*n)--; /* decrement number of elements */
        }
        else        /* otherwise, increment index */
            i++;

    if (*n < nelem) {   /* if outliers removed */
        void *dbltmp = realloc (*a, *n * sizeof **a);   /* realloc */
        if (!dbltmp) {  /* validate reallocation */
            perror ("realloc-a");
            return NULL;
        }
        *a = dbltmp;    /* assign reallocated block to array */
    }

    return *a;      /* return array */
}

接下来，不要使用“滚动自定义”功能。 C库提供了qsort，它比您自己的错误包含错误的可能性要小几个数量级（更不用说要快几个数量级了）。您需要做的就是编写一个qsort比较函数，该函数从数组中接收指向相邻元素的指针，如果第一个在第二个之前排序，则返回-1，如果元素是第二个，则返回0等于1，如果第二个在第一个之前排序。对于数字比较，您可以将结果返回两个不等式，以避免潜在的上溢/下溢，例如

    /* qsort compare to sort numbers in ascending order without overflow */
    return (a > b) - (a < b);

请注意，在您的情况下，a和b将是指向double（或float）的指针，为了比较双精度数，取消引用之前的正确转换是：< / p>

/* qsort compare function for doubles (ascending) */
int cmpdbl (const void *a, const void *b)
{
    return (*((double *)a) > *((double *)b)) - 
            (*((double *)a) < *((double *)b));
}

那是使用qsort之后唯一的挑战，即按照升序对数组进行排序，只需要以下几点：

        qsort (array, n, sizeof *array, cmpdbl);    /* use qsort to sort */

（完成...）

将其完全放在一个简短的示例中，该示例仅将您的数组读取为输入行（最大1024个字符，然后使用double将每个值转换为sscanf，并存储任意数量的在排序，获取中位数并调用您的删除函数之前，可以使用动态大小的array中的值编写如下。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXC 1024   /* max characters to read per-line (per-array) */
#define MAXD 8      /* initial number of doubles to allocate */

/* qsort compare function for doubles (ascending) */
int cmpdbl (const void *a, const void *b)
{
    return (*((double *)a) > *((double *)b)) - 
            (*((double *)a) < *((double *)b));
}

/* remove outliers from array 'a' given 'median'.
 * takes address of array 'a', address of number of elements 'n',
 * and median 'median' to remove outliers. a is reallocated following
 * removal and n is updated to reflect the number of elements that
 * remain. returns pointer to reallocated array on success, NULL otherwise.
 */
double *rmoutliers (double **a, size_t *n, double median)
{
    size_t i = 0, nelem = *n;   /* index, save initial numer of elements */

    while (i < *n)  /* loop over all elements indentifying outliers */
        if ((*a)[i] < 0.5 * median || (*a)[i] > 1.5 * median) {
            if (i < *n - 1)     /* if not end, use memmove to remove */
                memmove (&(*a)[i], &(*a)[i+1], 
                        (*n - i + 1) * sizeof **a);
            (*n)--; /* decrement number of elements */
        }
        else        /* otherwise, increment index */
            i++;

    if (*n < nelem) {   /* if outliers removed */
        void *dbltmp = realloc (*a, *n * sizeof **a);   /* realloc */
        if (!dbltmp) {  /* validate reallocation */
            perror ("realloc-a");
            return NULL;
        }
        *a = dbltmp;    /* assign reallocated block to array */
    }

    return *a;      /* return array */
}

int main (void) {

    char buf[MAXC];
    int arrcnt = 1;

    while (fgets (buf, MAXC, stdin)) {  /* read line of data into buf */
        int offset = 0, nchr = 0;
        size_t  n = 0, ndbl = MAXD, size;
        double  *array = malloc (ndbl * sizeof *array), /* allocate */
                dbltmp, median;

        if (!array) {   /* validate initial allocation */
            perror ("malloc-array");
            return 1;
        }
        /* parse into doubles, store in dbltmp (should use strtod) */
        while (sscanf (buf + offset, "%lf%n", &dbltmp, &nchr) == 1) {
            if (n == ndbl) {    /* check if reallocation requierd */
                void *tmp = realloc (array, 2 * ndbl * sizeof *array);
                if (!tmp) {     /* validate */
                    perror ("realloc-array");
                    break;
                }
                array = tmp;    /* assign reallocated block */
                ndbl *= 2;      /* update allocated number of doubles */
            }
            array[n++] = dbltmp;    /* assign to array, increment index */
            offset += nchr;     /* update offset in buffer */
        }

        qsort (array, n, sizeof *array, cmpdbl);    /* use qsort to sort */
        median = array[n / 2];                      /* get median */

        /* output original array and number of values */
        printf ("\narray[%d] - %zu values\n\n", arrcnt++, n);
        for (size_t i = 0; i < n; i++) {
            if (i && i % 10 == 0)
                putchar ('\n');
            printf (" %5.2f", array[i]);
        }
        printf ("\n\nmedian: %5.2f\n\n", median);

        size = n;   /* save orginal number of doubles in array in size */
        if (!rmoutliers (&array, &n, median))   /* remove outliers */
            return 1;

        if (n < size) { /* check if outliers removed */
            printf ("%zu outliers removed - %zu values\n\n", size - n, n);
            for (size_t i = 0; i < n; i++) {
                if (i && i % 10 == 0)
                    putchar ('\n');
                printf (" %5.2f", array[i]);
            }
            printf ("\n\n");
        }
        else    /* otherwise warn no outliers removed */
            fputs ("warning: no outliers found.\n\n", stderr);

        free (array);   /* don't forget to free what you allocate */
    }
}

（注意：，您应该真正使用strtod，因为sscanf除了报告转换成功/失败外没有提供任何错误处理，但这需要再过一天或您作为练习）

示例输入文件

注意：我没有在数据文件中使用size: X信息。不需要我只是使用了动态分配方案来根据需要调整数组的大小。我使用的输入文件格式在单独的一行上包含每个数组的测量值，例如

23.0 21.5 27.6 2.5 19.23 21.0 23.5 24.6 19.5 19.23 26.01 22.5 24.6 20.15 ... 18.93
11.12 10.32 9.91 14.32 12.32 20.37 13.32 11.57 2.32 13.32 11.22 12.32 ... 13.32

使用/输出示例

$ ./bin/rmoutliers <dat/outlierdata.txt

array[1] - 25 values

  2.50  5.23 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15
 21.00 21.50 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60
 24.60 26.01 26.60 27.60 45.50

median: 22.25

3 outliers removed - 22 values

 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15 21.00 21.50
 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60 24.60 26.01
 26.60 27.60


array[2] - 20 values

  2.32  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32
 12.32 12.58 12.91 13.32 13.32 13.32 14.32 14.56 20.37 35.32

median: 12.32

3 outliers removed - 17 values

  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32 12.32
 12.58 12.91 13.32 13.32 13.32 14.32 14.56

（注意：），在任何动态分配内存的代码中，您都应该通过内存错误检查程序（如适用于Linux的valgrind）运行该程序，其他操作系统也有类似的工具。只需在命令开头添加valgrind，例如valgrind ./bin/rmoutliers <dat/outlierdata.txt，并确认您已释放所有已分配的内存，并且没有内存错误。）

仔细研究一下，如果您有任何疑问，请告诉我。

内存使用/错误检查

在您的评论中，您似乎担心我所做的事情可能会泄漏内存-事实并非如此。如问题中所述，您可以使用valgrind之类的工具来验证内存使用情况并检查是否有任何内存错误，例如

$ valgrind ./bin/rmoutliers <dat/outlierdata.txt
==28383== Memcheck, a memory error detector
==28383== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==28383== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==28383== Command: ./bin/rmoutliers
==28383==

array[1] - 25 values

  2.50  5.23 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15
 21.00 21.50 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60
 24.60 26.01 26.60 27.60 45.50

median: 22.25

3 outliers removed - 22 values

 18.00 18.23 18.93 19.23 19.23 19.50 19.73 20.15 21.00 21.50
 22.25 22.50 22.50 23.00 23.26 23.50 24.50 24.60 24.60 26.01
 26.60 27.60


array[2] - 20 values

  2.32  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32
 12.32 12.58 12.91 13.32 13.32 13.32 14.32 14.56 20.37 35.32

median: 12.32

3 outliers removed - 17 values

  8.32  9.91 10.16 10.32 10.91 11.12 11.22 11.57 12.32 12.32
 12.58 12.91 13.32 13.32 13.32 14.32 14.56

==28383==
==28383== HEAP SUMMARY:
==28383==     in use at exit: 0 bytes in 0 blocks
==28383==   total heap usage: 8 allocs, 8 frees, 1,208 bytes allocated
==28383==
==28383== All heap blocks were freed -- no leaks are possible
==28383==
==28383== For counts of detected and suppressed errors, rerun with: -v
==28383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

如果您在上面指出，则与上面使用的内存相关的“ 8个分配和8个释放” ，例如：

==28383==   total heap usage: 8 allocs, 8 frees, 1,208 bytes allocated

您还可以确认所有内存都已释放，并且下一行没有泄漏：

==28383== All heap blocks were freed -- no leaks are possible

最后，您可以确认在程序执行过程中没有与使用内存相关的内存错误：

==28383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

如果代码的一部分在释放内存后遇到问题，请告诉我，我们很乐意进一步提供帮助。

如何使用给定的数组制作新数组？

2 个答案: