我有以下输入,如果“Cpd_number”和“ID3”相同,我想做几何平均值。这些文件有很多数据,所以我们可能需要数组才能完成这些技巧。但是,作为一个awk初学者,我不太清楚如何开始。有人可以提供一些提示吗?
输入:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”5”,”100”
“95”, “123”,”4”,”5”,”1”
“95”, “123”,”4”,”6”,”10”
“95”, “123”,”4”,”6”,”100”
“95”, “456”,”4”,”6”,”10”
“95”, “456”,”4”,”6”,”100”
三行“95”,“123”,“4”,“5”应该做几何平均值
两行“95”,“123”,“4”,“6”应该做几何平均值
两行“95”,“456”,“4”,“6”应该做几何平均值
这是所需的输出:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”6”,”31.62”
“95”, “456”,”4”,”6”,”31.62”
有关几何平均值的一些信息:
http://en.wikipedia.org/wiki/Geometric_mean
此脚本计算几何平均值
#!/usr/bin/awk -f
{
b = $1; # value of 1st column
C += log(b);
D++;
}
END {
print "Geometric mean : ",exp(C/D);
}
答案 0 :(得分:1)
拥有此文件:
$ cat infile
"ID1","Cpd_number","ID2","ID3","activity"
"95","123","4","5","10"
"95","123","4","5","100"
"95","123","4","5","1"
"95","123","4","6","10"
"95","123","4","6","100"
"95","456","4","6","10"
"95","456","4","6","100"
这篇文章:
awk -F\" 'BEGIN{print} # Print headers
last != $4""$8 && last{ # ONLY When last key "Cpd_number + ID3"
print line,exp(C/D) # differs from actual , print line + average
C=D=0} # reset acumulators
{ # This block process each line of infile
C += log($(NF-1)+0) # C calc
D++ # D counter
$(NF-1)="" # Get rid of activity col ir order to print line
line=$0 # Line will be actual line without activity
last=$4""$8} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print line,exp(C/D)}' infile
将抛出:
ID1 , Cpd_number , ID2 , ID3 , 0
95 , 123 , 4 , 5 , 10
95 , 123 , 4 , 6 , 31.6228
95 , 456 , 4 , 6 , 31.6228
还有一些工作需要格式化。
NOTE: char " is used instead of “ and ”
编辑:NF是文件中的字段数,因此 NF-1 将是最后一个:
$ awk -F\" 'BEGIN{getline}{print $(NF-1)}' infile
10
100
1
10
100
10
100
所以在: log($(NF-1)+0)我们将log函数应用于该值(添加0 sum以确保数值)
D ++ 你只是一个反击。
答案 1 :(得分:0)
为什么要使用awk,只需在bash中执行,使用bc
或calc
来处理浮点数学。您可以在http://www.isthe.com/chongo/src/calc/下载calc(最新的2.12.4.13-11)。有rpms,二进制和源代码tarball可用。在我看来,它远远优于bc。例行程序非常简单。 您需要先删除数据文件中的绝对"
引号,然后再保留csv文件。这有帮助。请参阅以下注释中使用的sed
命令。注意,下面的几何平均值是(id1 * cpd * id2 * id3)的第4个根。如果您需要不同的意思,只需调整以下代码:
#!/bin/bash
##
## You must strip all quotes from data before processing, or write more code to do
## it here. Just do "$ sed -d 's/\"//g' < datafile > newdatafile" Then use
## newdatafile as command line argument to this program
##
## Additionally, this script uses 'calc' for floating point math. go download it
## from: http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). You can also
## use bc if you like, but why, calc is so much better.
##
## test to make sure file passed as argument is readable
test -r "$1" || { echo "error: invalid input, usage: ${0//*\//} filename"; exit 1; }
## function to strip extraneous whitespace from input
trimWS() {
[[ -z $1 ]] && return 1
strln="${#1}"
[[ strln -lt 2 ]] && return 1
trimSTR=$1
trimSTR="${trimSTR#"${trimSTR%%[![:space:]]*}"}" # remove leading whitespace characters
trimSTR="${trimSTR%"${trimSTR##*[![:space:]]}"}" # remove trailing whitespace characters
echo $trimSTR
return 0
}
let cnt=0
let oldsum=0 # holds value to compare against new Cpd_number & ID3
product=1 # initialize product to 1
pcnt=0 # initialize the number of values in product
IFS=$',\n' # Internal Field Separator, set to break on ',' or newline
while read newid1 newcpd newid2 newid3 newact || test -n "$act"; do
cpd=`trimWS $cpd` # trimWS from cpd (only one that needed it)
# if first iteration, just output first row
test "$cnt" -eq 0 && echo " $newid1 $newcpd $newid2 $newid3 $newact"
# after first iteration, test oldsum -ne sum, if so do geometric mean
# and reset product and counters
if test "$cnt" -gt 0 ; then
sum=$((newcpd+newid3)) # calculate sum to test against oldsum
if test "$oldsum" -ne "$sum" && test "$cnt" -gt 1; then
# geometric mean (nth root of product)
# mean=`calc -p "root ($product, $pcnt)"` # using calc
mean=`echo "scale=6; e( l($product) / $pcnt)" | bc -l` # using bc
echo " $id1 $cpd $id2 $id3 average: $mean"
pcnt=0
product=1
fi
# update last values to new values
oldsum=$sum
id1="$newid1"
cpd="$newcpd"
id2="$newid2"
id3="$newid3"
act="$newact"
((product*=act)) # accumulate product
((pcnt+=1))
fi
((cnt+=1))
done < "$1"
输出:
# output using calc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 10
95 123 4 6 average: 31.62277660168379331999
95 456 4 6 average: 31.62277660168379331999
# output using bc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 9.999999
95 123 4 6 average: 31.622756
95 456 4 6 average: 31.622756
更新的脚本计算正确的平均值。由于必须保持旧/新值以测试cpd和amp;的变化,因此它涉及更多一点。 ID3。这可能是awk更简单的方法。但是如果你以后需要更多的灵活性,bash可能就是答案。