在下面的awk中,我尝试将$4
file1
中的值与第$4
之前file2
中_
的值进行匹配。我将$4
的值file1
存储在A
中。然后,我将$2
中的值标记为min
,将$3
中的值标记为max
,将$1
中的值标记为chr
。
如果$1 in A
等于array[1]
,那么我会使用min
,max
和chr
中存储的值来检查两者之间是否存在重叠$2
中的$3
,$
和file2
1个值。如果有,则打印overlap
,但如果没有打印missing
。我正在尝试确保线条匹配,并且坐标从file1
到file2
覆盖。我的实际数据是以下格式的数千行,并且file2
中的每一行都会产生匹配。我对awk
进行了评论并希望它有所帮助,因为我遇到了语法错误,也许还有更好的方法,但我想尝试一下。
如果我删除{split($4,array,"_")}
并删除array[1]
,我会获得当前输出,但并非所有行都只打印overlap
行,而且我不确定将打印完全匹配。
file1 tab-delimited
chr19 42373737 42373856 RPS19
chr6 32790021 32790140 TAP2
file2 tab-delimited
chr19 42364844 42364915 RPS19_cds_1_0_chr19_42364845_f 0 +
chr19 42365180 42365281 RPS19_cds_2_0_chr19_42365181_f 0 +
chr19 42373100 42373284 RPS19_cds_3_0_chr19_42373101_f 0 +
chr19 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 +
chr19 42375418 42375445 RPS19_cds_5_0_chr19_42375419_f 0 +
所需的输出 tab-delimited
chr19 42364844 42364915 RPS19_cds_1_0_chr19_42364845_f 0 + missing
chr19 42365180 42365281 RPS19_cds_2_0_chr19_42365181_f 0 + missing
chr19 42373100 42373284 RPS19_cds_3_0_chr19_42373101_f 0 + missing
chr19 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 + overlap
chr19 42375418 42375445 RPS19_cds_5_0_chr19_42375419_f 0 + missing
AWK
awk ' # call awk script
BEGIN { FS=OFS="\t" } # define FS and OFS as tab
FNR==NR{ # start processing same line in files
a[$4]; # store gene in
min[$4]=$2; # store staring coordinate
max[$4]=$3; # store ending coordinate
next # process next line
} # close block
{ # start block
split($4,array,"_"); # split $4 on _ and store in array[1]
print $0,(array[1] in a) && ($2>=min[array[1]] &&
$2<=max[array[1]])?"overlap":"missing" # print all lines followed by
overlap or missing depending on condition (if array[1] = a and $2 in
file2 is greater than or equal to min and $3 in file2 greater than or
equal to max print overlap, else missing)
} # close block
' file1 file2 # define input
当前输出
1 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 + overlap
答案 0 :(得分:4)
在这里你去超级明星awk
来救援:
也无法看到您的Input_file是实际的TAB分隔符,因此在此代码中也FS="\t"
之前使用Input_file1
。
awk 'FNR==NR{a[$4];min[$4]=$2;max[$4]=$3;next} {split($4,array,"_");print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"overlap":"missing"}' Input_file1 OFS="\t" Input_file2
现在也添加非单线形式的解决方案:
awk '
FNR==NR{
a[$4];
min[$4]=$2;
max[$4]=$3;
next
}
{
split($4,array,"_");
print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"overlap":"missing"
}
' Input_file1 OFS="\t" Input_file2
输出如下:
chr19 42364844 42364915 RPS19_cds_1_0_chr19_42364845_f 0 + missing
chr19 42365180 42365281 RPS19_cds_2_0_chr19_42365181_f 0 + missing
chr19 42373100 42373284 RPS19_cds_3_0_chr19_42373101_f 0 + missing
chr19 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 + overlap
chr19 42375418 42375445 RPS19_cds_5_0_chr19_42375419_f 0 + missing