处理awk中的文本文件并创建一个新文件

时间:2018-02-17 14:31:04

标签: awk

我有一个像这个小例子的文本文件:

chr10:102721669-102724893   3217    3218    5
chr10:102721669-102724893   3218    3219    1
chr10:102721669-102724893   3219    3220    5
chr10:102721669-102724893   421 422 1
chr10:102721669-102724893   858 859 2
chr10:102539319-102568941   13921   13922   1
chr10:102587299-102589074   1560    1561    1
chr10:102587299-102589074   1565    1566    1
chr10:102587299-102589074   1595    1596    1
chr10:102587299-102589074   944 945 1

预期输出如下:

chr10:102721669-102724893   3217    3218    5   CA
chr10:102721669-102724893   3218    3219    1   CA
chr10:102721669-102724893   3219    3220    5   CA
chr10:102721669-102724893   421 422 1   BA
chr10:102721669-102724893   858 859 2   BA
chr10:102539319-102568941   13921   13922   1   NON
chr10:102587299-102589074   1560    1561    1   CA  
chr10:102587299-102589074   1565    1566    1   CA
chr10:102587299-102589074   1595    1596    1   CA
chr10:102587299-102589074   944 945 1   BA

输入有4 tab separated列,在输出中,我还有一个列有3个不同的类(CA, NON or BA)。 1-如果输入中的1st column没有重复,在输出的5th column中它将被归类为NON 2-如果(the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is smaller than -30 (meaning -31 or smaller), that line will be classified as BA。例如在最后一行: (102587299 + 944) - 102589074 = -831 , so this line is classified as BA

3- if (the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is equal or bigger than -30 (meaning -30 or -29), that line will be classified as CA。例如第一行:

(102721669 + 3217) - 102724893 = -7

我想在awk中尝试这样做。

awk -F "\t"":""-" '{if($2+$4-$3 < -30) ; print $7 = BA,  if($2+$4-$3 >= -30) ; print $7 = CA}' file.txt > out.txt

但它不会返回我的期望。你知道怎么解决吗?

1 个答案:

答案 0 :(得分:2)

尝试

$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1]++; next}
       { split($1, b, /[\t:-]/);
         $5 = a[$1]==1 ? "NON" : (b[2]+$2-b[3]) < -30 ? "BA" : "CA" }
       1' file.txt file.txt
chr10:102721669-102724893   3217    3218    5   CA
chr10:102721669-102724893   3218    3219    1   CA
chr10:102721669-102724893   3219    3220    5   CA
chr10:102721669-102724893   421 422 1   BA
chr10:102721669-102724893   858 859 2   BA
chr10:102539319-102568941   13921   13922   1   NON
chr10:102587299-102589074   1560    1561    1   BA
chr10:102587299-102589074   1565    1566    1   BA
chr10:102587299-102589074   1595    1596    1   BA
chr10:102587299-102589074   944 945 1   BA
  • BEGIN{FS=OFS="\t"}将输入/输出字段分隔符设置为标签
  • NR==FNR{a[$1]++; next}计算文件中第一个字段的出现次数。输入文件传递两次,因此在第二次传递时我们可以根据计数
  • 做出决定
  • split($1, b, /[\t:-]/)进一步拆分第一列,结果保存在b数组
  • 其余代码根据给定条件分配第5个字段并打印修改后的行


进一步阅读