awk基于坐标范围和精确匹配在字段中打印文本

时间:2018-02-16 01:25:30

标签: awk

在下面的awk中,我尝试将$4 file1中的值与第$4之前file2_的值进行匹配。我将$4的值file1存储在A中。然后,我将$2中的值标记为min,将$3中的值标记为max,将$1中的值标记为chr。 如果$1 in A等于array[1],那么我会使用minmaxchr中存储的值来检查两者之间是否存在重叠$2中的$3$file2 1个值。如果有,则打印overlap,但如果没有打印missing。我正在尝试确保线条匹配,并且坐标从file1file2覆盖。我的实际数据是以下格式的数千行,并且file2中的每一行都会产生匹配。我对awk进行了评论并希望它有所帮助,因为我遇到了语法错误,也许还有更好的方法,但我想尝试一下。

如果我删除{split($4,array,"_")}并删除array[1],我会获得当前输出,但并非所有行都只打印overlap行,而且我不确定将打印完全匹配。

file1 tab-delimited

chr19   42373737    42373856    RPS19
chr6    32790021    32790140    TAP2

file2 tab-delimited

chr19   42364844    42364915    RPS19_cds_1_0_chr19_42364845_f  0   +
chr19   42365180    42365281    RPS19_cds_2_0_chr19_42365181_f  0   +
chr19   42373100    42373284    RPS19_cds_3_0_chr19_42373101_f  0   +
chr19   42373768    42373823    RPS19_cds_4_0_chr19_42373769_f  0   +
chr19   42375418    42375445    RPS19_cds_5_0_chr19_42375419_f  0   +

所需的输出 tab-delimited

chr19   42364844    42364915    RPS19_cds_1_0_chr19_42364845_f  0   +     missing
chr19   42365180    42365281    RPS19_cds_2_0_chr19_42365181_f  0   +     missing 
chr19   42373100    42373284    RPS19_cds_3_0_chr19_42373101_f  0   +     missing
chr19   42373768    42373823    RPS19_cds_4_0_chr19_42373769_f  0   +     overlap
chr19   42375418    42375445    RPS19_cds_5_0_chr19_42375419_f  0   +     missing

AWK

awk ' # call awk script
 BEGIN { FS=OFS="\t" }  # define FS and OFS as tab
  FNR==NR{  # start processing same line in files
   a[$4];  # store gene in 
   min[$4]=$2;  # store staring coordinate
   max[$4]=$3;  # store ending coordinate
    next         # process next line
}  # close block
 {  # start block
   split($4,array,"_");   # split $4 on _ and store in array[1]
   print $0,(array[1] in a) && ($2>=min[array[1]] && 
$2<=max[array[1]])?"overlap":"missing" # print all lines followed by 
overlap or missing depending on condition (if array[1] = a and $2 in 
file2 is greater than or equal to min and $3 in file2 greater than or 
equal to max print overlap, else missing)
}  # close block
' file1 file2  # define input

当前输出

1   42373768    42373823    RPS19_cds_4_0_chr19_42373769_f  0   +   overlap 

1 个答案:

答案 0 :(得分:4)

在这里你去超级明星awk来救援: 也无法看到您的Input_file是实际的TAB分隔符,因此在此代码中也FS="\t"之前使用Input_file1

awk 'FNR==NR{a[$4];min[$4]=$2;max[$4]=$3;next} {split($4,array,"_");print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"overlap":"missing"}'  Input_file1  OFS="\t"   Input_file2

现在也添加非单线形式的解决方案:

awk '
FNR==NR{
  a[$4];
  min[$4]=$2;
  max[$4]=$3;
  next
}
{
  split($4,array,"_");
  print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"overlap":"missing"
}
'  Input_file1  OFS="\t"  Input_file2

输出如下:

chr19   42364844    42364915    RPS19_cds_1_0_chr19_42364845_f  0   +   missing
chr19   42365180    42365281    RPS19_cds_2_0_chr19_42365181_f  0   +   missing
chr19   42373100    42373284    RPS19_cds_3_0_chr19_42373101_f  0   +   missing
chr19   42373768    42373823    RPS19_cds_4_0_chr19_42373769_f  0   +   overlap
chr19   42375418    42375445    RPS19_cds_5_0_chr19_42375419_f  0   +   missing