我有两个文件,文件A看起来像这样:
>MA0003.1_TFAP2A
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>MA0004.1_Arnt
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>MA0006.1_Arnt::Ahr
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
>MA0006.1_Arntr
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
>MA0006.1_ArntAh
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
>MA0006.1_Arnt::A
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
和文件B,看起来像这样(请注意fileB也有空格,每行中的最后一个单词很重要):
AP-2 TFAP2A
AXUD class 1 Arnt
AXU 2 Arnt::Ahr
AXU Arntr
AXU ArntAh
AXU Arnt::A
我想要第三个文件应该是文件A和B的组合。这样应该调整文件A中开头的名称标题,如下所示:
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>Axu 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
将 What I have done
作为文件A并提取第二个名称,该名称由下划线(_)分隔,如下所示:
awk '/>/' <input_for_clustering.pwm | tr '_' '\t' | awk '{print $2}' > temp
然后检查第二个文件中文件B中是否存在这些名称并将其解压缩,如下所示:
for i in `cat temp`
do
cat fileB | awk '{ if (($2=="'$i'")) {print $1 }}'>>data_res
done
现在问题是如何编辑文件A?
亲切地,帮助。我希望,我展示了我所付出的努力和想法。
答案 0 :(得分:2)
试试这个:
awk 'NR==FNR{z=$NF;$NF="";a[z]=$0;next}
/^>/{split($0,b,"_");if (b[2] in a){print ">"a[b[2]]}next}1' fileB fileA
结果:
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>AXU 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
答案 1 :(得分:2)
我认为这样做符合你的要求:
BEGIN { FS = "\t" }
NR==FNR { a[$2] = $1; next }
/^>/ { for (i in a) if ($0 ~ i "$") $0 = ">" a[i] }
{ print $0 }
当总记录数等于当前文件的记录号时(即我们在第一个文件中),构建包含替换的数组a
。 next
跳过脚本的其余部分并转到下一行。
对于以&#34;&gt;&#34;开头的第二个文件中的行,请浏览a
的键,找到匹配的行并替换该行。我添加了一个锚$
,因此模式必须位于该行的末尾。 { print $0 }
打印整行(可以缩写为1
。
测试出来:
$ awk -f swap.awk replace file
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>AXU 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
答案 2 :(得分:1)
这是一个Perl解决方案。它看起来有点神秘,因为它依赖于几个正则表达式。
策略是先处理FileB
,然后构建一个用FileA
转换字符串的哈希值。
所有输出都发送到STDOUT。
use strict;
use warnings;
use 5.010;
use autodie;
my %fb = do {
open my ($fh), '<', 'FileB.txt';
reverse map / ( \S+ (?: \s+ \S+ )* ) \s+ (\S+) /x, <$fh>;
};
open my ($fh), '<', 'FileA.txt';
while ( <$fh> ) {
s/^>\K[^_]*_(\S+).*/$fb{$1}/;
print;
}
<强>输出强>
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>AXU 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028