映射两个文件中的名称并获取单个文件

时间:2014-09-09 07:23:13

标签: bash perl awk

我有两个文件,文件A看起来像这样:

>MA0003.1_TFAP2A
5.4052885343e-06    5.4052885343e-06    0.999983784134  5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
0.118921753043  0.383780891224  0.248648677866  0.248648677866
0.10270588744   0.308106851744  0.329728005881  0.259459254935
0.0486530020973 0.421617910964  0.427023199498  0.10270588744
>MA0004.1_Arnt
0.200009998 0.799890021996  4.99900019996e-05   4.99900019996e-05
0.949860027994  4.99900019996e-05   0.0500399920016 4.99900019996e-05
4.99900019996e-05   4.99900019996e-05   4.99900019996e-05   0.999850029994
4.99900019996e-05   4.99900019996e-05   0.999850029994  4.99900019996e-05
>MA0006.1_Arnt::Ahr
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
4.16597233794e-05   0.95821529745   4.16597233794e-05   0.0417013831028
 >MA0006.1_Arntr
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
4.16597233794e-05   0.95821529745   4.16597233794e-05   0.0417013831028
 >MA0006.1_ArntAh
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
 4.16597233794e-05  0.95821529745   4.16597233794e-05   0.0417013831028
>MA0006.1_Arnt::A
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
4.16597233794e-05   0.95821529745   4.16597233794e-05   0.0417013831028

和文件B,看起来像这样(请注意fileB也有空格,每行中的最后一个单词很重要):

AP-2    TFAP2A
AXUD class 1    Arnt
AXU 2   Arnt::Ahr
AXU  Arntr
AXU ArntAh
AXU Arnt::A

我想要第三个文件应该是文件A和B的组合。这样应该调整文件A中开头的名称标题,如下所示:

>AP-2
5.4052885343e-06    5.4052885343e-06    0.999983784134  5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
0.118921753043  0.383780891224  0.248648677866  0.248648677866
0.10270588744   0.308106851744  0.329728005881  0.259459254935
0.0486530020973 0.421617910964  0.427023199498  0.10270588744
>AXUD class 1
0.200009998 0.799890021996  4.99900019996e-05   4.99900019996e-05
0.949860027994  4.99900019996e-05   0.0500399920016 4.99900019996e-05
4.99900019996e-05   4.99900019996e-05   4.99900019996e-05   0.999850029994
4.99900019996e-05   4.99900019996e-05   0.999850029994  4.99900019996e-05
>Axu 2
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
4.16597233794e-05   0.95821529745   4.16597233794e-05   0.0417013831028

What I have done作为文件A并提取第二个名称,该名称由下划线(_)分隔,如下所示:

awk '/>/' <input_for_clustering.pwm | tr '_' '\t' | awk '{print $2}' > temp

然后检查第二个文件中文件B中是否存在这些名称并将其解压缩,如下所示:

for i in `cat temp`
   do
         cat fileB | awk '{ if (($2=="'$i'")) {print $1 }}'>>data_res

       done

现在问题是如何编辑文件A?

亲切地,帮助。

我希望,我展示了我所付出的努力和想法。

3 个答案:

答案 0 :(得分:2)

试试这个:

awk 'NR==FNR{z=$NF;$NF="";a[z]=$0;next}
     /^>/{split($0,b,"_");if (b[2] in a){print ">"a[b[2]]}next}1' fileB fileA

结果:

>AP-2 
5.4052885343e-06    5.4052885343e-06    0.999983784134  5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
0.118921753043  0.383780891224  0.248648677866  0.248648677866
0.10270588744   0.308106851744  0.329728005881  0.259459254935
0.0486530020973 0.421617910964  0.427023199498  0.10270588744
>AXUD class 1 
0.200009998 0.799890021996  4.99900019996e-05   4.99900019996e-05
0.949860027994  4.99900019996e-05   0.0500399920016 4.99900019996e-05
4.99900019996e-05   4.99900019996e-05   4.99900019996e-05   0.999850029994
4.99900019996e-05   4.99900019996e-05   0.999850029994  4.99900019996e-05
>AXU 2 
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
4.16597233794e-05   0.95821529745   4.16597233794e-05   0.0417013831028

答案 1 :(得分:2)

我认为这样做符合你的要求:

BEGIN { FS = "\t" }
NR==FNR { a[$2] = $1; next }
/^>/ { for (i in a) if ($0 ~ i "$") $0 = ">" a[i] }
{ print $0 }

当总记录数等于当前文件的记录号时(即我们在第一个文件中),构建包含替换的数组anext跳过脚本的其余部分并转到下一行。

对于以&#34;&gt;&#34;开头的第二个文件中的行,请浏览a的键,找到匹配的行并替换该行。我添加了一个锚$,因此模式必须位于该行的末尾。 { print $0 }打印整行(可以缩写为1

测试出来:

$ awk -f swap.awk replace file
>AP-2
5.4052885343e-06    5.4052885343e-06    0.999983784134  5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
0.118921753043  0.383780891224  0.248648677866  0.248648677866
0.10270588744   0.308106851744  0.329728005881  0.259459254935
0.0486530020973 0.421617910964  0.427023199498  0.10270588744
>AXUD class 1
0.200009998 0.799890021996  4.99900019996e-05   4.99900019996e-05
0.949860027994  4.99900019996e-05   0.0500399920016 4.99900019996e-05
4.99900019996e-05   4.99900019996e-05   4.99900019996e-05   0.999850029994
4.99900019996e-05   4.99900019996e-05   0.999850029994  4.99900019996e-05
>AXU 2
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
4.16597233794e-05   0.95821529745   4.16597233794e-05   0.0417013831028

答案 2 :(得分:1)

这是一个Perl解决方案。它看起来有点神秘,因为它依赖于几个正则表达式。

策略是先处理FileB ,然后构建一个用FileA转换字符串的哈希值。

所有输出都发送到STDOUT。

use strict;
use warnings;
use 5.010;
use autodie;

my %fb = do {
  open my ($fh), '<', 'FileB.txt';
  reverse map / ( \S+ (?: \s+ \S+ )* ) \s+ (\S+) /x, <$fh>;
};

open my ($fh), '<', 'FileA.txt';
while ( <$fh> ) {
   s/^>\K[^_]*_(\S+).*/$fb{$1}/;
   print;
}

<强>输出

>AP-2
5.4052885343e-06    5.4052885343e-06    0.999983784134  5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
5.4052885343e-06    0.999983784134  5.4052885343e-06    5.4052885343e-06
0.118921753043  0.383780891224  0.248648677866  0.248648677866
0.10270588744   0.308106851744  0.329728005881  0.259459254935
0.0486530020973 0.421617910964  0.427023199498  0.10270588744
>AXUD class 1
0.200009998 0.799890021996  4.99900019996e-05   4.99900019996e-05
0.949860027994  4.99900019996e-05   0.0500399920016 4.99900019996e-05
4.99900019996e-05   4.99900019996e-05   4.99900019996e-05   0.999850029994
4.99900019996e-05   4.99900019996e-05   0.999850029994  4.99900019996e-05
>AXU 2
0.125020829862  0.333319446759  0.0833611064823 0.458298616897
4.16597233794e-05   4.16597233794e-05   0.95821529745   0.0417013831028
4.16597233794e-05   0.95821529745   4.16597233794e-05   0.0417013831028