split string and average by matching lines in awk

时间:2016-04-04 18:12:26

标签: awk

I am trying to output the count of the matching $4 values with the text in $5 before the - and the average of the matching $7. The output is sorted so that the matching $5 strings are grouped together. The awk is close but the output is empty and there probably is a better way, but hopefully is is a start :). Thank you :).

input

chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    1   25
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    2   25
chr1    1167809 1168658 chr1:1167809-1168658    B3GALT6-42|gc=75.8  1   2
chr1    1167809 1168658 chr1:1167809-1168658    B3GALT6-42|gc=75.8  2   2
chr1    1167809 1168658 chr1:1167809-1168658    B3GALT6-42|gc=75.8  3   2
chr1    976035  976270  chr1:976035-976270  AGRN-9|gc=74.5  228 28
chr1    976035  976270  chr1:976035-976270  AGRN-9|gc=74.5  229 28
chr1    976035  976270  chr1:976035-976270  AGRN-9|gc=74.5  230 27

desired output (matching $4 split of $5 with average of $7 sorted by $5)

chr1:955543-955763  2 AGRN  25  
chr1:976035-976270  3 AGRN  27
chr1:1167809-1168658 3 B3GALT6  2

awk

awk '
function file_print() {
for(k in a) {
split(k, ks, / |(-[0-9]*[|])/)
printf("%s %d %s %d\n", ks[1], c[k], ks[2], a[k] / c[k]) > ofn
delete a[k]
delete c[k]
}
close(ofn)
}
NR > 1 && FNR == 1 {
file_print()
}
FNR == 1 {
ofn = substr(FILENAME, 1, length(FILENAME))
}
{   a[k = $4 " " $5] += $7
c[k]++
}
END {   file_print()
}' input

2 个答案:

答案 0 :(得分:1)

awk to the rescue!

$ awk '{split($5,f,"-"); k=$4 OFS f[1]; s[k]+=$NF; c[k]++}
    END{for(k in s) print k, c[k], int(s[k]/c[k])}' file

chr1:955543-955763 AGRN 2 25
chr1:976035-976270 AGRN 3 27
chr1:1167809-1168658 B3GALT6 3 2

note the order is slightly different since $5 prefix is part of the key as well. Also the average is rounded down as in your example. If you need to rearrange, just pipe to ... | awk '{t=$2;$2=$3;$3=t}1' to swap two fields.

答案 1 :(得分:1)

I think you are overcomplicating the task.

If I understand your requirement, this produces the output (in slightly different order):

awk '{seen[$4]++; sub(/-.*/, "", $5); field[$4]=$5; sum[$4]+=$7} 
      END{for (e in seen) print e, seen[e], field[e], int(sum[e]/seen[e])}' file

chr1:1167809-1168658 3 B3GALT6 2
chr1:976035-976270 3 AGRN 27
chr1:955543-955763 2 AGRN 25

You can then run it through sort to groups and sort by $5 like so:

awk '{seen[$4]++; sub(/-.*/, "", $5); field[$4]=$5; sum[$4]+=$7} 
      END{for (e in seen) print e, seen[e], field[e], int(sum[e]/seen[e])}' file | sort -k 2

chr1:955543-955763 2 AGRN 25
chr1:976035-976270 3 AGRN 27
chr1:1167809-1168658 3 B3GALT6 2