如何使用awk合并特定行?

时间:2019-02-15 10:31:53

标签: awk concatenation

我正在寻找学习如何在Linux中连接多个列。我有一个看起来像这样的数据集:

gene    match_type  drug                sources      pmids
ABO     Definite    CHEMBL50267         DrugBank     17139284|17016423
ABO     Definite    URIDINE_DIPHOSPHATE TdgClinicalTrial     17139284|17016423
ABO     Definite    CHEMBL439009        DrugBank     12972418
ABO     Definite    CHEMBL1232343       DrugBank       NA
ABO     Definite    CHEMBL503075        DrugBank       NA   

我正试图将其合并为一行(将毒品栏,来源栏和pmids栏连接起来),如下所示:

gene    match_type  drug                                                                         sources                                           pmids
ABO     Definite    CHEMBL1232343 CHEMBL439009 CHEMBL50267 CHEMBL503075 URIDINE_DIPHOSPHATE NA  DrugBank TdgClinicalTrial DrugBank DrugBank DrugBank    0 12972418 17139284|17016423  17139284|17016423 NA NA

我已经研究过将awk与if语句一起使用,但是我不太确定从哪里开始,朝着正确方向提供的任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

如果您不担心标题的部分间距,请尝试遵循。

awk '
FNR==1{
  print
  next
}
{
  for(i=3;i<=NF;i++){
    a[$1 OFS i]=(a[$1 OFS i]?a[$1 OFS i] FS $i:$i)
  }
  b[$1]=$2
}
END{
  for(j in b){
    printf j OFS b[j] OFS
    for(i=3;i<=NF;i++){
       printf("%s %s",a[j OFS i],i==NF?ORS:OFS)
    }
  }
}' OFS="\t"  Input_file

说明: 现在添加上述命令的详细说明。

awk '                                                      ##Starting awk program here.
FNR==1{                                                    ##Checking condition if FNR==1 means first line of Input_file then do following.
  print                                                    ##Printing the current line.
  next                                                     ##next will skip all further lines from here.
}                                                          ##Closing FNR==1 condition BLOCK here.
{                                                          ##Starting BLOCK which will be executed apart from 1st line of Input_file.
  for(i=3;i<=NF;i++){                                      ##tarting a for loop which starts from i=3 to till value of NF.
    a[$1 OFS i]=(a[$1 OFS i]?a[$1 OFS i] FS $i:$i)         ##Creating an array a whose index is $1 and i value and concatenating its value with its own value.
  }                                                        ##Closing for loop block here.
  b[$1]=$2                                                 ##Creating array named b whose index is $1 and value is $2.
}                                                          ##Closing block for, for loop now.
END{                                                       ##Starting END block of awk program here.
  for(j in b){                                             ##Traversing through array b here.
    printf j OFS b[j] OFS                                  ##Printing value of j OFS value of b[j] and OFS value here.
    for(i=3;i<=NF;i++){                                    ##Starting for loop from i=3 to value of NF here.
       printf("%s %s",a[j OFS i],i==NF?ORS:OFS)            ##Printing value of a[j OFS i] along with either space or new line. New line should be printed when loop reached its maximum value.
    }                                                      ##Closing block for inner for loop here.
  }                                                        ##Closing block for outer for loop here.
}' OFS="\t"   file                                         ##Setting OFS as TAB here and mentioning Input_file name here.