按三列分组并创建表格(最好的awk)

时间:2016-12-16 13:17:39

标签: python awk count grouping

我需要帮助分组和计算许多专栏。

INPUT:tsv文件。

按1,2和第4列排序。

标题:字符串,开始,停止,长度,值

chr1    56971   57065   94      0.287234
chr1    565460  565601  141     0.411348
chr1    754342  754488  146     0.520548
chr1    783856  784002  146     0.315068
chr1    789652  789768  116     0.310345
chr1    790532  790628  96      0.520833
chr2    1744623 1744774 151     0.509934
chr2    1744623 1744774 151     0.509934
chr2    1744623 1744774 151     0.509934
chr2    1748501 1748635 134     0.440299
chr2    1748501 1748636 135     0.444444

输出:

                    0-10 length ... 90-100 ............140-150... 190-200
chr1:0-60000         A1(0), B1(0)..A2(1),B2(0.287234)..   A,B ... An,Bn
chr1:60000-120000          .             .                 .         . 
.                          .             .                 .         .
.                          .             .                 .         .
chr1:780000-840000       0,0     ..... 1,0.520833 ......1,0.315068..A,B
chr2:0-60000            A1,B1    .....   .        ......   .      .. .

A =区间0-60000的行数(输入的2n到3列)

B =输入中第5列的总和除以A(行数)

首先按第一列分组,按

创建区域
for i in {0..249480000..60000}

并且对于此区域计算按长度分组的行数(0..200..10)

我试过了:

for z in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
do
for i in {0..249480000..60000}
    do
u=$i
let "u +=60000"

“现在我不知道下一步是什么”。

我知道按一栏分组:

awk -F, 'NR>1{arr[$1]++}END{for (a in arr) print a, arr[a]}'

但这对我来说真的很难。你能帮我吗?

1 个答案:

答案 0 :(得分:1)

 awk -v Separator=' | ' '
    BEGIN{ LenStepSize = 10 ;  IntStepSize = 60000 }
    {
    # Store the labels
    Labels[ $1]++

    # Adapt the Step array size
    if ( IntLastIndex * IntStepSize < $3) IntLastIndex = int( $3 / IntStepSize) + 1
    IntIdx = int( $3 / IntStepSize)

    # Adapt the Length array size
    if( LenLastIndex * LenStepSize < $4) LenLastIndex = int( $4 / LenStepSize) + 1
    LenIdx = int( $4 / LenStepSize)

    # Create the mono "multi" index reference
    Idx = $1 "-" IntIdx "-" LenIdx

    # store the data element
    As[ Idx]++
    Bs[ Idx] += $5
    #printf( "DEBUG: As[%s]: %s | Bs[%s]:%s (+%s)\n", Idx, As[ Idx], Idx, Bs[ Idx], $5)
    }

    END {
       # Print the header
       printf( "Object               ")
       for ( Leng = 0; Leng <= LenLastIndex; Leng++ ) printf( "%s%3d - %3d", Separator, Leng, (Leng + 1) * LenStepSize)
       printf( "\n                     ")
       for ( Leng = 0; Leng <= LenLastIndex; Leng++ ) printf( "%s  length ", Separator)

       # print each element (empty or with value)
       # - lines per label
       for ( Label in Labels) {
          # - per sub section of intervale
          for ( Inter = 0; Inter <= IntLastIndex; Inter++ ) {
             printf( "\n%5s %7d-%7d", Label, Inter * IntStepSize, (Inter + 1) * IntStepSize - 1)

             # column per length section
             for ( Leng = 0; Leng <= LenLastIndex; Leng++ ) {
                Idx = Label "-" Inter "-" Leng
                printf( "%s%d , " ( Bs[ Idx] > 0 ? "%2.3f" : "%-5d") , Separator, As[ Idx], Bs[ Idx] / (As[ Idx] > 0 ? As[ Idx] : 1))
                }
             }
             print ""
          }
       }
    ' tsv.file
  • 使用simili多维数组(1个索引,但由3个组成 元件)
  • 基于元素大小的动态(避免在内存中创建一个巨大的几乎空的数组)
  • 不适用于大型数据文件(由于内存影响)
  • 输出格式是基本的(没有基于内容的列或行大小,...)
    • 在awk的开头添加Separator变量以查看(在本例中)列,但允许您将任何模式设置为分隔符(如空格或“,”,...)以满足您的实际需求< / LI>