awk几何平均值在同一行值

时间:2014-06-26 08:47:56

标签: arrays bash awk mean

我有以下输入,如果“Cpd_number”和“ID3”相同,我想做几何平均值。这些文件有很多数据,所以我们可能需要数组才能完成这些技巧。但是,作为一个awk初学者,我不太清楚如何开始。有人可以提供一些提示吗?

输入:

“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”5”,”100”
“95”, “123”,”4”,”5”,”1”
“95”, “123”,”4”,”6”,”10”
“95”, “123”,”4”,”6”,”100”
“95”, “456”,”4”,”6”,”10”
“95”, “456”,”4”,”6”,”100”

三行“95”,“123”,“4”,“5”应该做几何平均值

两行“95”,“123”,“4”,“6”应该做几何平均值

两行“95”,“456”,“4”,“6”应该做几何平均值

这是所需的输出:

“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”6”,”31.62”
“95”, “456”,”4”,”6”,”31.62”

有关几何平均值的一些信息:

http://en.wikipedia.org/wiki/Geometric_mean

此脚本计算几何平均值

 #!/usr/bin/awk -f
 {
   b  = $1;   # value of 1st column
   C += log(b);  
   D++; 
 }

 END {
   print "Geometric mean  : ",exp(C/D);
   }

2 个答案:

答案 0 :(得分:1)

拥有此文件:

$ cat infile
"ID1","Cpd_number","ID2","ID3","activity"
"95","123","4","5","10"
"95","123","4","5","100"
"95","123","4","5","1"
"95","123","4","6","10"
"95","123","4","6","100"
"95","456","4","6","10"
"95","456","4","6","100"

这篇文章:

awk -F\" 'BEGIN{print}            # Print headers
      last != $4""$8 && last{     # ONLY When last key  "Cpd_number + ID3" 
          print line,exp(C/D)     # differs from actual , print line + average
          C=D=0}                  # reset acumulators
      { # This block process each line of infile
       C += log($(NF-1)+0)        # C calc
       D++                        # D counter
       $(NF-1)=""                 # Get rid of activity col ir order to print line
       line=$0                    # Line will be actual line without activity
       last=$4""$8}               # Store the key in orther to track switching 
      END{ # This block triggers after the complete file read
           # to print the last average that cannot be trigger during
           # the previous block 
          print line,exp(C/D)}' infile

将抛出:

 ID1 , Cpd_number ,  ID2 , ID3 ,   0
 95 ,  123 , 4 , 5 ,   10
 95 ,  123 , 4 , 6 ,   31.6228
 95 ,  456 , 4 , 6 ,   31.6228

还有一些工作需要格式化。

NOTE: char " is used  instead of “ and ”

编辑:NF是文件中的字段数,因此 NF-1 将是最后一个:

$ awk -F\" 'BEGIN{getline}{print $(NF-1)}' infile                                                                                 
10
100
1
10
100
10
100

所以在: log($(NF-1)+0)我们将log函数应用于该值(添加0 sum以确保数值)

D ++ 你只是一个反击。

答案 1 :(得分:0)

为什么要使用awk,只需在bash中执行,使用bccalc来处理浮点数学。您可以在http://www.isthe.com/chongo/src/calc/下载calc(最新的2.12.4.13-11)。有rpms,二进制和源代码tarball可用。在我看来,它远远优于bc。例行程序非常简单。 您需要先删除数据文件中的绝对"引号,然后再保留csv文件。这有帮助。请参阅以下注释中使用的sed命令。注意,下面的几何平均值是(id1 * cpd * id2 * id3)的第4个根。如果您需要不同的意思,只需调整以下代码:

#!/bin/bash

##
##  You must strip all quotes from data before processing, or write more code to do
##  it here. Just do "$ sed -d 's/\"//g' < datafile > newdatafile" Then use 
##  newdatafile as command line argument to this program
##
##  Additionally, this script uses 'calc' for floating point math. go download it
##  from: http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). You can also
##  use bc if you like, but why, calc is so much better.
##

## test to make sure file passed as argument is readable
test -r "$1" || { echo "error: invalid input, usage: ${0//*\//} filename"; exit 1; }

## function to strip extraneous whitespace from input
trimWS() {
    [[ -z $1 ]] && return 1
    strln="${#1}"
    [[ strln -lt 2 ]] && return 1
    trimSTR=$1
    trimSTR="${trimSTR#"${trimSTR%%[![:space:]]*}"}"  # remove leading whitespace characters
    trimSTR="${trimSTR%"${trimSTR##*[![:space:]]}"}"  # remove trailing whitespace characters
    echo $trimSTR
    return 0
}

let cnt=0
let oldsum=0    # holds value to compare against new Cpd_number & ID3
product=1       # initialize product to 1
pcnt=0          # initialize the number of values in product
IFS=$',\n'      # Internal Field Separator, set to break on ',' or newline

while read newid1 newcpd newid2 newid3 newact || test -n "$act"; do

    cpd=`trimWS $cpd`  # trimWS from cpd (only one that needed it)

    # if first iteration, just output first row
    test "$cnt" -eq 0 && echo " $newid1 $newcpd $newid2 $newid3 $newact"

    # after first iteration, test oldsum -ne sum, if so do geometric mean
    # and reset product and counters
    if test "$cnt" -gt 0 ; then

        sum=$((newcpd+newid3))   # calculate sum to test against oldsum
        if test "$oldsum" -ne "$sum" && test "$cnt" -gt 1; then
            # geometric mean (nth root of product)
            # mean=`calc -p "root ($product, $pcnt)"`  # using calc
            mean=`echo "scale=6; e( l($product) / $pcnt)" | bc -l` # using bc
            echo " $id1 $cpd $id2 $id3  average: $mean"
            pcnt=0
            product=1
        fi

        # update last values to new values
        oldsum=$sum
        id1="$newid1"
        cpd="$newcpd"
        id2="$newid2"
        id3="$newid3"
        act="$newact"

        ((product*=act))  # accumulate product
        ((pcnt+=1))
    fi

    ((cnt+=1))

done < "$1"

输出:

# output using calc
ID1 Cpd_number  ID2 ID3 activity
95 123 4 5  average: 10
95 123 4 6  average: 31.62277660168379331999
95 456 4 6  average: 31.62277660168379331999

# output using bc
ID1 Cpd_number  ID2 ID3 activity
95 123 4 5  average: 9.999999
95 123 4 6  average: 31.622756
95 456 4 6  average: 31.622756

更新的脚本计算正确的平均值。由于必须保持旧/新值以测试cpd和amp;的变化,因此它涉及更多一点。 ID3。这可能是awk更简单的方法。但是如果你以后需要更多的灵活性,bash可能就是答案。