我需要帮助分组和计算许多专栏。
INPUT:tsv文件。
按1,2和第4列排序。
标题:字符串,开始,停止,长度,值
chr1 56971 57065 94 0.287234
chr1 565460 565601 141 0.411348
chr1 754342 754488 146 0.520548
chr1 783856 784002 146 0.315068
chr1 789652 789768 116 0.310345
chr1 790532 790628 96 0.520833
chr2 1744623 1744774 151 0.509934
chr2 1744623 1744774 151 0.509934
chr2 1744623 1744774 151 0.509934
chr2 1748501 1748635 134 0.440299
chr2 1748501 1748636 135 0.444444
输出:
0-10 length ... 90-100 ............140-150... 190-200
chr1:0-60000 A1(0), B1(0)..A2(1),B2(0.287234).. A,B ... An,Bn
chr1:60000-120000 . . . .
. . . . .
. . . . .
chr1:780000-840000 0,0 ..... 1,0.520833 ......1,0.315068..A,B
chr2:0-60000 A1,B1 ..... . ...... . .. .
A =区间0-60000的行数(输入的2n到3列)
B =输入中第5列的总和除以A(行数)
首先按第一列分组,按
创建区域for i in {0..249480000..60000}
并且对于此区域计算按长度分组的行数(0..200..10)
我试过了:
for z in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
do
for i in {0..249480000..60000}
do
u=$i
let "u +=60000"
“现在我不知道下一步是什么”。
我知道按一栏分组:
awk -F, 'NR>1{arr[$1]++}END{for (a in arr) print a, arr[a]}'
但这对我来说真的很难。你能帮我吗?
答案 0 :(得分:1)
awk -v Separator=' | ' '
BEGIN{ LenStepSize = 10 ; IntStepSize = 60000 }
{
# Store the labels
Labels[ $1]++
# Adapt the Step array size
if ( IntLastIndex * IntStepSize < $3) IntLastIndex = int( $3 / IntStepSize) + 1
IntIdx = int( $3 / IntStepSize)
# Adapt the Length array size
if( LenLastIndex * LenStepSize < $4) LenLastIndex = int( $4 / LenStepSize) + 1
LenIdx = int( $4 / LenStepSize)
# Create the mono "multi" index reference
Idx = $1 "-" IntIdx "-" LenIdx
# store the data element
As[ Idx]++
Bs[ Idx] += $5
#printf( "DEBUG: As[%s]: %s | Bs[%s]:%s (+%s)\n", Idx, As[ Idx], Idx, Bs[ Idx], $5)
}
END {
# Print the header
printf( "Object ")
for ( Leng = 0; Leng <= LenLastIndex; Leng++ ) printf( "%s%3d - %3d", Separator, Leng, (Leng + 1) * LenStepSize)
printf( "\n ")
for ( Leng = 0; Leng <= LenLastIndex; Leng++ ) printf( "%s length ", Separator)
# print each element (empty or with value)
# - lines per label
for ( Label in Labels) {
# - per sub section of intervale
for ( Inter = 0; Inter <= IntLastIndex; Inter++ ) {
printf( "\n%5s %7d-%7d", Label, Inter * IntStepSize, (Inter + 1) * IntStepSize - 1)
# column per length section
for ( Leng = 0; Leng <= LenLastIndex; Leng++ ) {
Idx = Label "-" Inter "-" Leng
printf( "%s%d , " ( Bs[ Idx] > 0 ? "%2.3f" : "%-5d") , Separator, As[ Idx], Bs[ Idx] / (As[ Idx] > 0 ? As[ Idx] : 1))
}
}
print ""
}
}
' tsv.file
Separator
变量以查看(在本例中)列,但允许您将任何模式设置为分隔符(如空格或“,”,...)以满足您的实际需求< / LI>