Question

我正在尝试对两个大文件进行比较，制表符分隔。我一直在尝试使用awk＆amp; bash（Ubuntu 15.10），python（v3.5）和powershell（windows 10）。我唯一的背景是Java，但我的领域往往坚持使用脚本语言。

我想看看

文件1 A []

1   gramene gene    4854    9652    .   -   .   ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:C0P8I2];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1   gramene gene    9882    10387   .   -   .   ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1   gramene gene    109519  111769  .   -   .   ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1   gramene gene    136307  138929  .   +   .   ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1

文件2 B []

S1_6370 T/C 1   6370    +
S1_8210 T   1   8210    +
S1_8376 A   1   8376    +
S1_9889 A   1   9889    +

输出

1   ID=gene:GRMZM2G059865   4857    9652    -   S1_6370 T/C 6370    +   
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8210 T   8210    +
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8376 A   8376    +
1   ID=gene:GRMZM5G888250   9882    10387   -   S1_9889 A   9889    +

我的一般逻辑

loop (until end of A[ ] and B[ ])
if
B[$4]>A[$4] && B[$4]<A[$5]  #if the value in B column 4 is in between the values in A columns 4 & 5.
then
-F”\t” print {A[1], A[9(filtered)], A[$4FS$5], B[$1], B[$2], B[$3], B[$4], B[$5]}   #hopefully reflects awk column calls if the two files were able to have their columns defined that way.
movea++ # to see if the next set of B column 4 values is in between the values in A columns 4 & 5 
else
moveb++ #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.

我知道这种逻辑并不遵循我所知道的任何语言，但部分相似。似乎NR和FNR是两个内置在awk中的运行值。 Awk帮我把B [$ 1]中有10个值的文件2分成10个文件很容易，并且还切出了超过你看到的5个列的几百列（~255 +）。现在我正在处理文件2大小大约几MB而不是一个1.6 GB的文件。除了减少加载时间，我想简化循环。我没有回溯到我之前的python或powershell尝试，因为我减少了文件大小。我说服自己，他们只是不打算用他们的内置库或cmdlet读取我的文件。如果我无法找到一个awk解决方案，我会尽快尝试。

comparing multiple files and columns using awk #referenced Awk greater than less than but within a set range #referenced efficiently splitting one file into several files by value of column＃有一件事有用 Using awk to get a specific string in line＃可以过滤第9列 How to check value of a column lies between values of two columns in other file and print corresponding value from column in Unix?＃这似乎是最接近但没有在我想要的第三个文件中打印出来，仍然无法完全弄清楚语法

Answer 1

尝试：

$ awk 'BEGIN{x=getline s <"B"; split(s,b,"\t")} !x{exit} {sub(/;.*/,"",$9); while (x && $4<b[4] && b[4]<$5){print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]; x=getline s <"B"; split(s,b,"\t")}}' OFS='\t' A
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_6370 T/C     6370    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8210 T       8210    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8376 A       8376    +
1       ID=gene:GRMZM5G888250   9882    10387   -       S1_9889 A       9889    +

如何运作

该程序隐式循环遍历文件A的行。

BEGIN{x=getline s <"B"; split(s,b,"\t")}

在我们开始阅读文件A之前，请将文件B的第一行读入字符串s。使用制表符作为分隔符将该字符串拆分为数组b。

函数getline会将x设置为true，直到我们用完行读取文件B为止。
!x{exit}

如果我们在文件B中读取了用完的行，那么exit该程序。
sub(/;.*/,"",$9)

从文件A的字段9中删除;之后的所有内容。
while (x && $4<b[4] && b[4]<$5){print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]; x=getline s <"B"; split(s,b,"\t")}

循环遍历文件B的行，只要第B行的第四个字段位于文件A的字段4和5的值之间，就打印请求的输出。

函数getline会将x设置为true，直到我们用完行读取文件B为止。
OFS='\t'

将输出字段分隔符设为选项卡。

多行版本

对于那些喜欢将awk代码拆分为多行的人：

awk '

BEGIN{
    x=getline s <"B"
    split(s,b,"\t")
} 

!x {
    exit
} 

{   
    sub(/;.*/,"",$9)
    while (x && $4<b[4] && b[4]<$5) {
        print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]
        x=getline s <"B"; split(s,b,"\t")
    }
}
' OFS='\t' A

Answer 2

另一种基于awk的解决方案：

$ awk -F'\t' 'NR==FNR{
         b0[NR]=$0;
         b4[NR]=$4;
         b_count=NR;
         next;
       }
       {
           for(i=1;i<=b_count;i++)
              if((b4[i]>$4) && (b4[i]<$5)){
                  print $1, gensub(/;.*/,"",1,$9), $4, $5, b0[i]
              }
       }' OFS=$'\t' file_b file_a

输出：

1   ID=gene:GRMZM2G059865   4854    9652    S1_6370 T/C 1   6370    +
1   ID=gene:GRMZM2G059865   4854    9652    S1_8210 T   1   8210    +
1   ID=gene:GRMZM2G059865   4854    9652    S1_8376 A   1   8376    +
1   ID=gene:GRMZM5G888250   9882    10387   S1_9889 A   1   9889    +

说明：

NR==FNR第一个文件 - file_b
在本地数组中记录整个文件 - b0＆amp; B4
＆安培;跳过第二个文件的处理代码 - next
对于下一个文件，请比较＆amp;以所需格式打印行。
gensub：正则表达式替换函数，用于格式化fileA中的第9个字段。像split函数这样的替代机制也是可能的。

awk比较3个值，第一个文件值之间的第二个文件值与两个文件之间的多列打印输出到第3个

2 个答案:

如何运作

多行版本