我的数据文件(data.txt)看起来像这样,
0.01667 20.53
0.01667 6.35
0.01667 6.94
0.01667 7.07
0.01667 8.06
0.01667 8.10
0.01667 8.25
0.01667 8.71
0.01667 9.31
0.02500 20.19
0.02500 6.35
0.02500 6.92
0.02500 7.07
0.02500 8.08
0.02500 8.09
0.02500 8.24
0.02500 8.70
0.02500 9.26
0.03333 19.89
0.03333 6.33
0.03333 6.90
0.03333 7.07
0.03333 8.07
0.03333 8.09
0.03333 8.22
0.03333 8.70
0.03333 9.22
0.04167 19.65
0.04167 6.34
0.04167 6.87
0.04167 7.07
0.04167 8.03
0.04167 8.08
0.04167 8.19
0.04167 8.69
0.04167 9.19
0.05000 19.40
0.05000 6.32
0.05000 6.85
0.05000 7.06
0.05000 8.02
0.05000 8.09
0.05000 8.16
0.05000 8.71
0.05000 9.15
0.05833 19.12
0.05833 6.29
0.05833 6.84
0.05833 7.04
0.05833 8.01
0.05833 8.11
0.05833 8.16
0.05833 8.71
0.05833 9.11
0.06667 18.84
0.06667 6.29
0.06667 6.82
0.06667 7.05
0.06667 7.98
0.06667 8.11
0.06667 8.14
0.06667 8.71
0.06667 9.06
0.07500 18.57
0.07500 6.29
0.07500 6.80
0.07500 7.06
0.07500 7.97
0.07500 8.10
0.07500 8.13
0.07500 8.71
0.07500 9.02
第1列是第2列中的测量值的时间。我需要对第1列中给出的每个时间平均第2列中的值,并输出时间值和该时间的平均值。我可以使用以下awk代码
来执行aveargeawk '{if($1<0)$1=0}
{
sum[$1]+=$2
cnt[$1]++
}
END {
# print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
for (i in sum)
printf "%8.5f %6.2f %6d %6.3f\n", i, sum[i], cnt[i], sum[i]/cnt[i]
}' data.txt | sort -n -k1 > avgFile.txt
请注意,我还输出了一些其他内容,以便我可以检查我做了正确的事情。正如您所看到的,每个时间段的数据都包含异常值,我需要删除它们。我试图选择在0.01667收集的说法数据到某个文件temp.txt,我有以下awk代码正确删除异常值
awk 'BEGIN{CNT=0} {ROW[CNT]=$0;DATA[CNT]=$2;
TOTAL+=$2;CNT+=1;} END{for (i = 0;i < NR; i++){if ((sqrt((DATA[i]-(TOTAL/NR))^2))<((TOTAL/NR)*30/100))
{print ROW[i] ;}}}' temp.txt
但是我需要在原始代码中执行此操作,以便在计算第2列中值的平均值之前每次删除此异常值
任何帮助都将受到高度赞赏。
答案 0 :(得分:0)
计算平均值,然后删除异常值,然后在删除异常值后重新计算avergaes:
$ cat tst.awk
{
vals[$1][$2]
sum[$1] += $2
cnt[$1]++
}
END {
div = 0.3
for (time in vals) {
ave = sum[time] / cnt[time]
low = ave * (1 - div)
high = ave * (1 + div)
for (val in vals[time]) {
if ( (val < low) || (val > high) ) {
print "Deleting outlier", time, val | "cat>&2"
sum[time] -= val
cnt[time]--
}
}
}
for (time in vals) {
ave = (cnt[time] > 0 ? sum[time] / cnt[time] : 0)
print time, sum[time], cnt[time], ave
}
}
$ awk -f tst.awk file
0.05000 56.04 7 8.00571
0.07500 62.08 8 7.76
0.04167 56.12 7 8.01714
0.03333 56.27 7 8.03857
0.01667 56.44 7 8.06286
0.06667 55.87 7 7.98143
0.02500 56.36 7 8.05143
0.05833 55.98 7 7.99714
Deleting outlier 0.05000 6.32
Deleting outlier 0.05000 19.40
Deleting outlier 0.07500 18.57
Deleting outlier 0.04167 19.65
Deleting outlier 0.04167 6.34
Deleting outlier 0.03333 6.33
Deleting outlier 0.03333 19.89
Deleting outlier 0.01667 6.35
Deleting outlier 0.01667 20.53
Deleting outlier 0.06667 6.29
Deleting outlier 0.06667 18.84
Deleting outlier 0.02500 20.19
Deleting outlier 0.02500 6.35
Deleting outlier 0.05833 6.29
Deleting outlier 0.05833 19.12
那是你在寻找什么?它使用GNU awk实现真正的二维数组。
答案 1 :(得分:0)
好的,我告诉你,当我有时间的时候我会写一个快速的脚本(事实证明它不是那么快)这会删除异常值并返回清理过的数组的平均值。如果需要,您可以实施标准偏差。如果您有任何疑问,请与我联系。:
#!/bin/bash
## generic error/usage function
function usage {
local ecode=${2:-0}
test -n "$1" && printf "\n %s\n" "$1" >&2
cat >&2 << helpMessage
usage: ${0//*\//} datafile
This script will process a 2-column datafile to provide average,
mean and std. deviation for each time group of data while removing
outlying data from the calculation. The datafile format:
time value
0.01667 20.53 <- outlier
0.01667 6.35
0.01667 6.94
...
Options:
-h | --help program help (this file)
helpMessage
exit $ecode;
}
## function to calculate average of arguments
function average {
local sum=0
declare -i count=0
for n in $@; do
sum=$( printf "scale=6; %s+%s\n" "$sum" "$n" | bc )
((count++))
done
avg=$( printf "scale=6; %s/%s\n" "$sum" "$count" | bc )
printf "%s\n" "$avg"
}
## function to examine arguments a remove any outlier
# that is greater than 4 from the average.
# values without the outlier are returned to command line
function rmoutlier {
local avg=$(average $@)
local diff=0
for i in $@; do
diff=$( printf "scale=6; %s-%s\n" "$i" "$avg" | bc )
[ "${diff:0:1}" = '-' ] && diff="${diff:1}" # quick absolute value hack
[ "${diff:0:1}" = '.' ] && diff=0 # set any fractional 0
if [ $((${diff//.*/})) -lt 4 ]; then
clean+=( $i ) # if whole num diff < 4, keep
else
echo "->outlier: $i" >&2 # print outlier to stderr
fi
done
echo ${clean[@]} # return array
}
## respond to -h or --help
test "${1:1}" = 'h' || test "${1:2}" = 'help' && usage
## set variables
dfn="${1:-dat/outlier.dat}" # datafile (default dat/outlier.dat)
declare -a tmp # temporary array holding data for given time
ptime=0 # variable holding previous time (flag for 1st line)
## validate input filename
test -r "$dfn" || usage "Error: invalid input. File '$dfn' not found" 1
while read -r time data || [ -n "$data" ]; do # read all lines of data
if [ "$ptime" = 0 ] || [ "$ptime" = "$time" ]; then # if no change in time
tmp+=( $data ) # fill array with data
else
echo " time: $ptime data : '${tmp[@]}'" >&2 # output array to stderr
## process data
clean=( $(rmoutlier ${tmp[@]} ) ) # remove outlier
echo " time: $ptime clean: '${clean[@]}'" >&2 # output clean array
avgclean=$( average ${clean[@]} ) # average clean array
printf " avgclean: %s\n\n" "$avgclean" >&2 # output avg of clean array
unset tmp # reset variables for next time
unset clean
unset avgclean
tmp+=( $data ) # read first value for next time set
fi
ptime="$time" # save previous time for comparison
done <"$dfn"
## process final time block
echo " time: $ptime data : '${tmp[@]}'" >&2
## process data
clean=( $(rmoutlier ${tmp[@]} ) )
echo " time: $ptime clean: '${clean[@]}'" >&2
avgclean=$( average ${clean[@]} )
printf " avgclean: %s\n\n" "$avgclean" >&2
unset tmp
unset clean
unset avgclean
exit 0
<强>用法:强>
./outlier.sh datafile
<强>输出:强>
$ ./outlier.sh dat/outlier.dat
time: 0.01667 data : '20.53 6.35 6.94 7.07 8.06 8.10 8.25 8.71 9.31'
->outlier: 20.53
time: 0.01667 clean: '6.35 6.94 7.07 8.06 8.10 8.25 8.71 9.31'
avgclean: 7.848750
time: 0.02500 data : '20.19 6.35 6.92 7.07 8.08 8.09 8.24 8.70 9.26'
->outlier: 20.19
time: 0.02500 clean: '6.35 6.92 7.07 8.08 8.09 8.24 8.70 9.26'
avgclean: 7.838750
time: 0.03333 data : '19.89 6.33 6.90 7.07 8.07 8.09 8.22 8.70 9.22'
->outlier: 19.89
time: 0.03333 clean: '6.33 6.90 7.07 8.07 8.09 8.22 8.70 9.22'
avgclean: 7.825000
time: 0.04167 data : '19.65 6.34 6.87 7.07 8.03 8.08 8.19 8.69 9.19'
->outlier: 19.65
time: 0.04167 clean: '6.34 6.87 7.07 8.03 8.08 8.19 8.69 9.19'
avgclean: 7.807500
time: 0.05000 data : '19.40 6.32 6.85 7.06 8.02 8.09 8.16 8.71 9.15'
->outlier: 19.40
time: 0.05000 clean: '6.32 6.85 7.06 8.02 8.09 8.16 8.71 9.15'
avgclean: 7.795000
time: 0.05833 data : '19.12 6.29 6.84 7.04 8.01 8.11 8.16 8.71 9.11'
->outlier: 19.12
time: 0.05833 clean: '6.29 6.84 7.04 8.01 8.11 8.16 8.71 9.11'
avgclean: 7.783750
time: 0.06667 data : '18.84 6.29 6.82 7.05 7.98 8.11 8.14 8.71 9.06'
->outlier: 18.84
time: 0.06667 clean: '6.29 6.82 7.05 7.98 8.11 8.14 8.71 9.06'
avgclean: 7.770000
time: 0.07500 data : '18.57 6.29 6.80 7.06 7.97 8.10 8.13 8.71 9.02'
->outlier: 18.57
time: 0.07500 clean: '6.29 6.80 7.06 7.97 8.10 8.13 8.71 9.02'
avgclean: 7.760000
附录:写时间&amp; avg to file
下面是脚本的更新位,它将time
和clean average
输出到文件(默认值:dat / outlier.out)。只有包含“输出文件名”ofn
的行已更改。 (你可以将你想要的任何输出文件名作为第二个参数传递给脚本)所以新的usage:
将是:outlier.sh input_file output_file
:
## set variables
dfn="${1:-dat/outlier.dat}" # datafile (default dat/outlier.dat)
ofn="${2:-dat/outlier.out}" # output file (default dat/outlier.out)
declare -a tmp # temporary array holding data for given time
ptime=0 # variable holding previous time (flag for 1st line)
:> "$ofn" # truncate output file
## validate input filename
test -r "$dfn" || usage "Error: invalid input. File '$dfn' not found" 1
while read -r time data || [ -n "$data" ]; do # read all lines of data
if [ "$ptime" = 0 ] || [ "$ptime" = "$time" ]; then # if no change in time
tmp+=( $data ) # fill array with data
else
echo " time: $ptime data : '${tmp[@]}'" >&2 # output array to stderr
printf " time: %s " "$ptime" >>"$ofn" # output array to file
## process data
clean=( $(rmoutlier ${tmp[@]} ) ) # remove outlier
echo "time: $ptime clean: '${clean[@]}'" >&2 # output clean array
avgclean=$( average ${clean[@]} ) # average clean array
printf " avgclean: %s\n\n" "$avgclean" >&2 # output avg of clean array
printf " avgclean: %s\n" "$avgclean" >>"$ofn" # output avg of clean array to file
unset tmp # reset variables for next time
unset clean
unset avgclean
tmp+=( $data ) # read first value for next time set
fi
ptime="$time" # save previous time for comparison
done <"$dfn"
<强> outlier.out:强>
time: 0.01667 avgclean: 7.848750
time: 0.02500 avgclean: 7.838750
time: 0.03333 avgclean: 7.825000
time: 0.04167 avgclean: 7.807500
time: 0.05000 avgclean: 7.795000
time: 0.05833 avgclean: 7.783750
time: 0.06667 avgclean: 7.770000