Awk计算平均忽略异常值 - 用于分段文件

时间:2014-09-19 07:37:51

标签: bash awk average outliers

我的数据文件(data.txt)看起来像这样,

0.01667 20.53
0.01667 6.35
0.01667 6.94
0.01667 7.07
0.01667 8.06
0.01667 8.10
0.01667 8.25
0.01667 8.71
0.01667 9.31
0.02500 20.19
0.02500 6.35
0.02500 6.92
0.02500 7.07
0.02500 8.08
0.02500 8.09
0.02500 8.24
0.02500 8.70
0.02500 9.26
0.03333 19.89
0.03333 6.33
0.03333 6.90
0.03333 7.07
0.03333 8.07
0.03333 8.09
0.03333 8.22
0.03333 8.70
0.03333 9.22
0.04167 19.65
0.04167 6.34
0.04167 6.87
0.04167 7.07
0.04167 8.03
0.04167 8.08
0.04167 8.19
0.04167 8.69
0.04167 9.19
0.05000 19.40
0.05000 6.32
0.05000 6.85
0.05000 7.06
0.05000 8.02
0.05000 8.09
0.05000 8.16
0.05000 8.71
0.05000 9.15
0.05833 19.12
0.05833 6.29
0.05833 6.84
0.05833 7.04
0.05833 8.01
0.05833 8.11
0.05833 8.16
0.05833 8.71
0.05833 9.11
0.06667 18.84
0.06667 6.29
0.06667 6.82
0.06667 7.05
0.06667 7.98
0.06667 8.11
0.06667 8.14
0.06667 8.71
0.06667 9.06
0.07500 18.57
0.07500 6.29
0.07500 6.80
0.07500 7.06
0.07500 7.97
0.07500 8.10
0.07500 8.13
0.07500 8.71
0.07500 9.02

第1列是第2列中的测量值的时间。我需要对第1列中给出的每个时间平均第2列中的值,并输出时间值和该时间的平均值。我可以使用以下awk代码

来执行avearge
awk '{if($1<0)$1=0}
    {
        sum[$1]+=$2
        cnt[$1]++
    }
    END {
    #     print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
        for (i in sum)
            printf "%8.5f   %6.2f   %6d   %6.3f\n", i, sum[i], cnt[i], sum[i]/cnt[i]

    }' data.txt  | sort -n -k1 > avgFile.txt

请注意,我还输出了一些其他内容,以便我可以检查我做了正确的事情。正如您所看到的,每个时间段的数据都包含异常值,我需要删除它们。我试图选择在0.01667收集的说法数据到某个文件temp.txt,我有以下awk代码正确删除异常值

awk 'BEGIN{CNT=0} {ROW[CNT]=$0;DATA[CNT]=$2; 
    TOTAL+=$2;CNT+=1;} END{for (i = 0;i < NR; i++){if ((sqrt((DATA[i]-(TOTAL/NR))^2))<((TOTAL/NR)*30/100)) 
    {print ROW[i] ;}}}' temp.txt

但是我需要在原始代码中执行此操作,以便在计算第2列中值的平均值之前每次删除此异常值

任何帮助都将受到高度赞赏。

2 个答案:

答案 0 :(得分:0)

计算平均值,然后删除异常值,然后在删除异常值后重新计算avergaes:

$ cat tst.awk
{
    vals[$1][$2]
    sum[$1] += $2
    cnt[$1]++
}

END {
    div = 0.3
    for (time in vals) {
        ave  = sum[time] / cnt[time]
        low  = ave * (1 - div)
        high = ave * (1 + div)
        for (val in vals[time]) {
            if ( (val < low) || (val > high) ) {
                print "Deleting outlier", time, val | "cat>&2"
                sum[time] -= val
                cnt[time]--
            }
        }
    }

    for (time in vals) {
        ave = (cnt[time] > 0 ? sum[time] / cnt[time] : 0)
        print time, sum[time], cnt[time], ave
    }
}

$ awk -f tst.awk file
0.05000 56.04 7 8.00571
0.07500 62.08 8 7.76
0.04167 56.12 7 8.01714
0.03333 56.27 7 8.03857
0.01667 56.44 7 8.06286
0.06667 55.87 7 7.98143
0.02500 56.36 7 8.05143
0.05833 55.98 7 7.99714
Deleting outlier 0.05000 6.32
Deleting outlier 0.05000 19.40
Deleting outlier 0.07500 18.57
Deleting outlier 0.04167 19.65
Deleting outlier 0.04167 6.34
Deleting outlier 0.03333 6.33
Deleting outlier 0.03333 19.89
Deleting outlier 0.01667 6.35
Deleting outlier 0.01667 20.53
Deleting outlier 0.06667 6.29
Deleting outlier 0.06667 18.84
Deleting outlier 0.02500 20.19
Deleting outlier 0.02500 6.35
Deleting outlier 0.05833 6.29
Deleting outlier 0.05833 19.12

那是你在寻找什么?它使用GNU awk实现真正的二维数组。

答案 1 :(得分:0)

好的,我告诉你,当我有时间的时候我会写一个快速的脚本(事实证明它不是那么快)这会删除异常值并返回清理过的数组的平均值。如果需要,您可以实施标准偏差。如果您有任何疑问,请与我联系。:

#!/bin/bash

## generic error/usage function
function usage {
    local ecode=${2:-0}
    test -n "$1" && printf "\n %s\n" "$1" >&2
cat >&2 << helpMessage

usage:  ${0//*\//} datafile

This script will process a 2-column datafile to provide average, 
mean and std. deviation for each time group of data while removing
outlying data from the calculation. The datafile format:

    time    value
    0.01667 20.53  <- outlier
    0.01667 6.35
    0.01667 6.94
    ...

Options:

    -h  |  --help  program help (this file)

helpMessage

    exit $ecode;
}

## function to calculate average of arguments
function average {

    local sum=0
    declare -i count=0
    for n in $@; do
        sum=$( printf "scale=6; %s+%s\n" "$sum" "$n" | bc )
        ((count++))
    done
    avg=$( printf "scale=6; %s/%s\n" "$sum" "$count" | bc )
    printf "%s\n" "$avg"
}

## function to examine arguments a remove any outlier
#  that is greater than 4 from the average.
#  values without the outlier are returned to command line
function rmoutlier {

    local avg=$(average $@)
    local diff=0
    for i in $@; do
        diff=$( printf "scale=6; %s-%s\n" "$i" "$avg" | bc )
        [ "${diff:0:1}" = '-' ]  && diff="${diff:1}"            # quick absolute value hack
        [ "${diff:0:1}" = '.' ]  && diff=0                      # set any fractional 0
        if [ $((${diff//.*/})) -lt 4 ]; then
            clean+=( $i )                                       # if whole num diff < 4, keep
        else
            echo "->outlier: $i" >&2                            # print outlier to stderr
        fi
    done
    echo ${clean[@]}                                            # return array
}

## respond to -h or --help
test "${1:1}" = 'h' || test "${1:2}" = 'help' && usage

## set variables
dfn="${1:-dat/outlier.dat}"     # datafile (default dat/outlier.dat)
declare -a tmp                  # temporary array holding data for given time
ptime=0                         # variable holding previous time (flag for 1st line)

## validate input filename
test -r "$dfn" || usage "Error: invalid input. File '$dfn' not found" 1

while read -r time data || [ -n "$data" ]; do               # read all lines of data

    if [ "$ptime" = 0 ] || [ "$ptime" = "$time" ]; then     # if no change in time
        tmp+=( $data )                                      # fill array with data
    else
        echo "  time: $ptime  data : '${tmp[@]}'" >&2       # output array to stderr

        ## process data
        clean=( $(rmoutlier ${tmp[@]} ) )                   # remove outlier
        echo  "  time: $ptime  clean: '${clean[@]}'" >&2    # output clean array
        avgclean=$( average ${clean[@]} )                   # average clean array
        printf "  avgclean: %s\n\n" "$avgclean" >&2         # output avg of clean array

        unset tmp           # reset variables for next time
        unset clean
        unset avgclean

        tmp+=( $data )      # read first value for next time set
    fi

    ptime="$time"           # save previous time for comparison

done <"$dfn"

## process final time block

echo "  time: $ptime  data : '${tmp[@]}'" >&2

## process data
clean=( $(rmoutlier ${tmp[@]} ) )
echo  "  time: $ptime  clean: '${clean[@]}'" >&2
avgclean=$( average ${clean[@]} )
printf "  avgclean: %s\n\n" "$avgclean" >&2

unset tmp
unset clean
unset avgclean

exit 0

<强>用法:

./outlier.sh  datafile

<强>输出:

$ ./outlier.sh dat/outlier.dat
  time: 0.01667  data : '20.53 6.35 6.94 7.07 8.06 8.10 8.25 8.71 9.31'
->outlier: 20.53
  time: 0.01667  clean: '6.35 6.94 7.07 8.06 8.10 8.25 8.71 9.31'
  avgclean: 7.848750

  time: 0.02500  data : '20.19 6.35 6.92 7.07 8.08 8.09 8.24 8.70 9.26'
->outlier: 20.19
  time: 0.02500  clean: '6.35 6.92 7.07 8.08 8.09 8.24 8.70 9.26'
  avgclean: 7.838750

  time: 0.03333  data : '19.89 6.33 6.90 7.07 8.07 8.09 8.22 8.70 9.22'
->outlier: 19.89
  time: 0.03333  clean: '6.33 6.90 7.07 8.07 8.09 8.22 8.70 9.22'
  avgclean: 7.825000

  time: 0.04167  data : '19.65 6.34 6.87 7.07 8.03 8.08 8.19 8.69 9.19'
->outlier: 19.65
  time: 0.04167  clean: '6.34 6.87 7.07 8.03 8.08 8.19 8.69 9.19'
  avgclean: 7.807500

  time: 0.05000  data : '19.40 6.32 6.85 7.06 8.02 8.09 8.16 8.71 9.15'
->outlier: 19.40
  time: 0.05000  clean: '6.32 6.85 7.06 8.02 8.09 8.16 8.71 9.15'
  avgclean: 7.795000

  time: 0.05833  data : '19.12 6.29 6.84 7.04 8.01 8.11 8.16 8.71 9.11'
->outlier: 19.12
  time: 0.05833  clean: '6.29 6.84 7.04 8.01 8.11 8.16 8.71 9.11'
  avgclean: 7.783750

  time: 0.06667  data : '18.84 6.29 6.82 7.05 7.98 8.11 8.14 8.71 9.06'
->outlier: 18.84
  time: 0.06667  clean: '6.29 6.82 7.05 7.98 8.11 8.14 8.71 9.06'
  avgclean: 7.770000

  time: 0.07500  data : '18.57 6.29 6.80 7.06 7.97 8.10 8.13 8.71 9.02'
->outlier: 18.57
  time: 0.07500  clean: '6.29 6.80 7.06 7.97 8.10 8.13 8.71 9.02'
  avgclean: 7.760000

附录:写时间&amp; avg to file

下面是脚本的更新位,它将timeclean average输出到文件(默认值:dat / outlier.out)。只有包含“输出文件名”ofn的行已更改。 (你可以将你想要的任何输出文件名作为第二个参数传递给脚本)所以新的usage:将是:outlier.sh input_file output_file

## set variables
dfn="${1:-dat/outlier.dat}"     # datafile (default dat/outlier.dat)
ofn="${2:-dat/outlier.out}"     # output file (default dat/outlier.out)
declare -a tmp                  # temporary array holding data for given time
ptime=0                         # variable holding previous time (flag for 1st line)

:> "$ofn"                       # truncate output file

## validate input filename
test -r "$dfn" || usage "Error: invalid input. File '$dfn' not found" 1

while read -r time data || [ -n "$data" ]; do               # read all lines of data

    if [ "$ptime" = 0 ] || [ "$ptime" = "$time" ]; then     # if no change in time
        tmp+=( $data )                                      # fill array with data
    else
        echo "  time: $ptime  data : '${tmp[@]}'" >&2       # output array to stderr
        printf "  time: %s  " "$ptime" >>"$ofn"             # output array to file

        ## process data
        clean=( $(rmoutlier ${tmp[@]} ) )                   # remove outlier
        echo  "time: $ptime  clean: '${clean[@]}'" >&2      # output clean array
        avgclean=$( average ${clean[@]} )                   # average clean array
        printf "  avgclean: %s\n\n" "$avgclean" >&2         # output avg of clean array
        printf "  avgclean: %s\n" "$avgclean" >>"$ofn"      # output avg of clean array to file

        unset tmp           # reset variables for next time
        unset clean
        unset avgclean

        tmp+=( $data )      # read first value for next time set
    fi

    ptime="$time"           # save previous time for comparison

done <"$dfn"

<强> outlier.out:

time: 0.01667    avgclean: 7.848750
time: 0.02500    avgclean: 7.838750
time: 0.03333    avgclean: 7.825000
time: 0.04167    avgclean: 7.807500
time: 0.05000    avgclean: 7.795000
time: 0.05833    avgclean: 7.783750
time: 0.06667    avgclean: 7.770000