我正在寻找学习如何在Linux中连接多个列。我有一个看起来像这样的数据集:
gene match_type drug sources pmids
ABO Definite CHEMBL50267 DrugBank 17139284|17016423
ABO Definite URIDINE_DIPHOSPHATE TdgClinicalTrial 17139284|17016423
ABO Definite CHEMBL439009 DrugBank 12972418
ABO Definite CHEMBL1232343 DrugBank NA
ABO Definite CHEMBL503075 DrugBank NA
我正试图将其合并为一行(将毒品栏,来源栏和pmids栏连接起来),如下所示:
gene match_type drug sources pmids
ABO Definite CHEMBL1232343 CHEMBL439009 CHEMBL50267 CHEMBL503075 URIDINE_DIPHOSPHATE NA DrugBank TdgClinicalTrial DrugBank DrugBank DrugBank 0 12972418 17139284|17016423 17139284|17016423 NA NA
我已经研究过将awk与if语句一起使用,但是我不太确定从哪里开始,朝着正确方向提供的任何帮助将不胜感激。
答案 0 :(得分:1)
如果您不担心标题的部分间距,请尝试遵循。
awk '
FNR==1{
print
next
}
{
for(i=3;i<=NF;i++){
a[$1 OFS i]=(a[$1 OFS i]?a[$1 OFS i] FS $i:$i)
}
b[$1]=$2
}
END{
for(j in b){
printf j OFS b[j] OFS
for(i=3;i<=NF;i++){
printf("%s %s",a[j OFS i],i==NF?ORS:OFS)
}
}
}' OFS="\t" Input_file
说明: 现在添加上述命令的详细说明。
awk ' ##Starting awk program here.
FNR==1{ ##Checking condition if FNR==1 means first line of Input_file then do following.
print ##Printing the current line.
next ##next will skip all further lines from here.
} ##Closing FNR==1 condition BLOCK here.
{ ##Starting BLOCK which will be executed apart from 1st line of Input_file.
for(i=3;i<=NF;i++){ ##tarting a for loop which starts from i=3 to till value of NF.
a[$1 OFS i]=(a[$1 OFS i]?a[$1 OFS i] FS $i:$i) ##Creating an array a whose index is $1 and i value and concatenating its value with its own value.
} ##Closing for loop block here.
b[$1]=$2 ##Creating array named b whose index is $1 and value is $2.
} ##Closing block for, for loop now.
END{ ##Starting END block of awk program here.
for(j in b){ ##Traversing through array b here.
printf j OFS b[j] OFS ##Printing value of j OFS value of b[j] and OFS value here.
for(i=3;i<=NF;i++){ ##Starting for loop from i=3 to value of NF here.
printf("%s %s",a[j OFS i],i==NF?ORS:OFS) ##Printing value of a[j OFS i] along with either space or new line. New line should be printed when loop reached its maximum value.
} ##Closing block for inner for loop here.
} ##Closing block for outer for loop here.
}' OFS="\t" file ##Setting OFS as TAB here and mentioning Input_file name here.