我非常擅长使用awk,但我发现很多帮助Fredrik Pihl对这个问题的回答是关于如何计算一个字段的平均值(3美元)而不是许多共享另一个字段的记录($ 1):< / p>
问题:如果行(特定字段)匹配,则awk平均部分列
输入样本:
$cat NDVI-bm
P01 031.RAW 0.516 0 0
P01 021.RAW 0.449 0 0
P02 045.RAW 0.418 0 0
P03 062.RAW 0.570 0 0
P03 064.RAW 0.469 0 0
P04 083.RAW 0.636 0 0
P04 081.RAW 0.592 0 0
P04 082.RAW 0.605 0 0
P04 084.RAW 0.648 0 0
P05 093.RAW 0.748 0 0
Fredrik Pihl的回答:
{
sum[$1]+=$3
cnt[$1]++
}
END {
print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
for (i in sum)
print i "\t" sum[i] "\t" cnt[i] "\t" sum[i]/cnt[i]
}
然而,我也试图计算匹配字段的方差(每个值和均值之间的差值的平方和除以计数)。我想我可能需要一种方法来计算每个匹配记录的END结构之前的平均值,或者如果可以在END结构中进行整个方差计算,但是我不知何故需要检索$ 3的原始值。我不知道怎么做。感谢任何提示。
答案 0 :(得分:2)
GNU代码awk:
{
sum[$1]+=$3
count[$1]++
groups[$3]=$1
}
END {
for (i in sum) mean[i]=sum[i]/count[i]
for (i in groups) meandiff[i]=i-mean[groups[i]]
for (i in groups) sumdiff2[groups[i]]+=meandiff[i]^2
for (i in sumdiff2) var[i]=sumdiff2[i]/count[i]
for (i in var)
print "group:", i, "count:", count[i], "\tmean:", mean[i], "\tsum:", sum[i], "\tsumdiff^2:", sumdiff2[i], "\t\tvariance:", var[i]
}
$cat file P01 031.RAW 0.516 0 0 P01 021.RAW 0.449 0 0 P02 045.RAW 0.418 0 0 P03 062.RAW 0.570 0 0 P03 064.RAW 0.469 0 0 P04 083.RAW 0.636 0 0 P04 081.RAW 0.592 0 0 P04 082.RAW 0.605 0 0 P04 084.RAW 0.648 0 0 P05 093.RAW 0.748 0 0 $awk -f prog.awk file group: P01 count: 2 mean: 0.4825 sum: 0.965 sumdiff^2: 0.0022445 variance: 0.00112225 group: P02 count: 1 mean: 0.418 sum: 0.418 sumdiff^2: 0 variance: 0 group: P03 count: 2 mean: 0.5195 sum: 1.039 sumdiff^2: 0.0051005 variance: 0.00255025 group: P04 count: 4 mean: 0.62025 sum: 2.481 sumdiff^2: 0.00204875 variance: 0.000512188 group: P05 count: 1 mean: 0.748 sum: 0.748 sumdiff^2: 0 variance: 0
答案 1 :(得分:0)
您可以通过计算样本的平方和和来计算最终的方差。
然后
variance = (Sum of squares - (Sum*Sum)/n)/n
所以
{
sum[$1]+=$3
sum_squares[$1]+=$3*$3
cnt[$1]++
}
END {
print "Name" "\t" "sum" "\t" "cnt" "\t" "avg" "\t" "var"
for (i in sum)
print i "\t" sum[i] "\t" cnt[i] "\t" sum[i]/cnt[i] "\t" (sum_squares[i] - (sum[i]*sum[i])/cnt[i])/cnt[i]
}
要选择特定模式,请将其添加到求和计算的开头(注意END匹配文件的结尾),例如
/P03/ {
sum[$1]+=$3
sum_squares[$1]+=$3*$3
cnt[$1]++
}
现在只处理包含P03的行
答案 2 :(得分:0)
我认为在你的awk脚本中, 基团[$ 3] = $ 1 如果几个原始数据在第3列中具有相同的值,则无法正常工作,因为这些N个相同的值在组中只计算一次。