使用匹配键将哈希值替换为文件中的文本

时间:2015-07-02 18:24:47

标签: perl

我想将匹配哈希键的文件中的所有单词替换为相应的值。

哈希:

$VAR1 = {
    'asmbl_1'  => 'TCONS_00000046',
    'asmbl_2'  => 'TCONS_00000014',
    'asmbl_16' => 'MELO3C000012',
}

文件:

CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|asmbl_2";

期望的输出:

CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|TCONS_00000014";

我正在寻找一种直接的方法,最好是在Perl中,因为我在Perl中编写脚本。

途径:

  • 逐行读取文件,从文件中提取密钥,在哈希中匹配此密钥并将其替换为值。
  • 逐对读取哈希对,打开文件,逐行读取并替换匹配。

(这两种方法有什么区别?)

  • 逐对读取哈希对并调用bash“sed -i '/key/value/'”。有点难看,我宁愿在Perl中做所有事情。

2 个答案:

答案 0 :(得分:3)

我喜欢一个很好的技巧,基本上是构建一个正则表达式并使用它来捕获和匹配你的正则表达式:

use strict;
use warnings;

my %replace = (
    'asmbl_1'  => 'TCONS_00000046',
    'asmbl_2'  => 'TCONS_00000014',
    'asmbl_16' => 'MELO3C000012',
);

my $search = join( "|", map {quotemeta} sort { length ($b) <=> length ($a) } keys %replace );
$search = qr/\b($search)\b/;

while (<>) {
    s/$search/$replace{$1}/g;
    print;
}

这样的东西会产生所需的输出。 (钻石运营商从STDIN读取内容或通过myscript.pl <some_File_To_process>

调用

答案 1 :(得分:3)

这就是必要的

use strict;
use warnings;

my %map = (
    asmbl_1  => 'TCONS_00000046',
    asmbl_2  => 'TCONS_00000014',
    asmbl_16 => 'MELO3C000012',
);

my $re = join '|', map quotemeta, keys %map;

while ( <DATA> ) {
    s/\b($re)\b/$map{$1}/g;
    print;
}

__DATA__
CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|asmbl_2";

输出

CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|TCONS_00000014";