我想打印匹配的搜索模式,然后计算平均行。最好是一个例子:
输入文件:
chr17 41275978 41276294 BRCA1_ex02_01 278
chr17 41275978 41276294 BRCA1_ex02_01 279
chr17 41275978 41276294 BRCA1_ex02_01 280
chr17 41275978 41276294 BRCA1_ex02_02 281
chr17 41275978 41276294 BRCA1_ex02_02 282
chr17 41275978 41276294 BRCA1_ex02_03 283
chr17 41275978 41276294 BRCA1_ex02_03 284
chr17 41275978 41276294 BRCA1_ex02_03 285
chr17 41275978 41276294 BRCA1_ex02_04 286
chr17 41275978 41276294 BRCA1_ex02_04 287
chr17 41275978 41276294 BRCA1_ex02_04 288
我在bash循环中的wana提取(例如)只是相同的第4列:
OUTPUT1:
chr17 41275978 41276294 BRCA1_ex02_01 278
chr17 41275978 41276294 BRCA1_ex02_01 279
chr17 41275978 41276294 BRCA1_ex02_01 280
OUTPUT2:
chr17 41275978 41276294 BRCA1_ex02_02 281
chr17 41275978 41276294 BRCA1_ex02_02 282
OUTPUT3:
chr17 41275978 41276294 BRCA1_ex02_03 283
chr17 41275978 41276294 BRCA1_ex02_03 284
chr17 41275978 41276294 BRCA1_ex02_03 285
等等......然后计算第5列的平均值非常容易:
awk'END {sum + = $ 5} {print NR / sum}'in_file.txt
在我的情况下,有数千行BRCA1_exXX_XX - 所以任何想法热点拆分它?
保罗。
答案 0 :(得分:2)
我认为这会做你想要的。
awk '{
# Keep running sum of fifth column based on value of fourth column.
v[$4]+=$5;
# Keep count of lines with similar fourth column values.
n[$4]++
}
END {
# Loop over all the values we saw and print out their fourth columns and the sum of the fifth columns.
for (val in n) {
print val ": " v[val] / n[val]
}
}' $file
答案 1 :(得分:1)
假设条目按照给定数据按第4列排序,您可以这样做:
awk '
$4 != prev { # if this line's 4th column is different from the previous line
if (cnt > 0) # if count of lines is greater than 0
print prev, sum / cnt # print the average
prev = $4 # save previous 4th column
sum = $5 # initialize sum to column 5
cnt = 1 # initialize count to 1
next # go to next line
}
{
sum += $5 # accumulate total of 5th column
++cnt # increment count of lines
}
END {
if (cnt > 0) # if count > 0 (avoid divide by 0 on empty file)
print prev, sum / cnt # print the average for the last line
}
' file