根据列组合文件

时间:2015-07-06 13:26:31

标签: python perl awk

我想根据文件中的mir_seq合并100个文件。输出应该是包含mir_seq的一个文件和原始文件中的列freq

文件如下所示:

文件1:

 mir_seq                                    seq                      name                   freq    mir start   end mism    add t5  t3  s5  s3  DB  ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT  TGAGAAGAAGCACTGTAGCTCTT seq_100006_x0     0 hsa-miR-143-3p  61  81  6AT u-TT    0   0   AGTCTGAG    GCTCAGGA    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA  GACCCTGTAGATCCGAATTTGTA seq_100012_x1   1   hsa-miR-10a-5p  22  43  1GT u-A 0   u-G TATATACC    TGTGTAAG    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG  GACCCTGTAGATCCGAATTTGTG seq_100013_x54  54  hsa-miR-10a-5p  22  44  1GT 0   0   0   TATATACC    TGTGTAAG    miRNA   1

file2的:

mir_seq                                  seq    name    freq    mir start   end mism    add t5  t3  s5  s3  DB    ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT  TGAGAAGAAGCACTGTAGCTCTT seq_100006_x1   1   hsa-miR-143-3p  61  81  6AT u-TT    0   0   AGTCTGAG    GCTCAGGA    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA  GACCCTGTAGATCCGAATTTGTA seq_100012_x0   0   hsa-miR-10a-5p  22  43  1GT u-A 0   u-G TATATACC    TGTGTAAG    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG  GACCCTGTAGATCCGAATTTGTG seq_100013_x24  24  hsa-miR-10a-5p  22  44  1GT 0   0   0   TATATACC    TGTGTAAG    miRNA   1
hsa-miR-1296-5p_TTAGGGCCCTGGCTCCATCT    TTAGGGCCCTGGCTCCATCT    seq_100019_x17  17  hsa-miR-1296-5p 16  35  0   0   0   u-CC    TGGGTTAG    CTCCTTTA    miRNA   1

这些文件的名称是这样的,只是_.txt.mirna之间的部分不同,并且是以制表符分隔的:

Miraligner_94G.txt.mirna
Miraligner_944G.txt.mirna

输出文件应该是这样的:

mir_seq                                  freq_94G     freq_944G     freq_912R
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT   0            12            55

2 个答案:

答案 0 :(得分:2)

您只提供了一个示例输入文件,因此显然未经测试,因为您无法仅使用1个文件测试“合并”:

awk '
FNR==1 {
    split(FILENAME,tmp,/[_.]/)
    sfx = tmp[2]
    sfxs[sfx]
}
{
    keys[$1]
    val[$1,sfx] = $4
}
END {
    printf "mir_seq"
    for (sfx in sfxs) {
        printf "%sfreq_%s", OFS, sfx
    }
    print ""

    for (key in keys) {
        printf "%s", key
        for (sfx in sfxs) {
            printf "%s%d", OFS, val[key,sfx]
        }
        print ""
    }
}
' Miraligner_*

答案 1 :(得分:2)

好的,鉴于您正在处理文件:

Miraligner_94G.txt.mirna
Miraligner_944G.txt.mirna

看起来你只是挑选每个列。

所以:

#!/usr/bin/env perl
use strict;
use warnings;

my %data;
my %seen;

foreach my $file ( glob("Miraligner_*") ) {
    my ($freq_id) = ( $file =~ m/\_(\w+).txt/ );
    $freq_id = "freq_$freq_id";
    $seen{$freq_id}++;
    open( my $input, "<", $file ) or die $!;
    my @headers = split( ' ', <$input> );
    while (<$input>) {
        my %line;
        @line{@headers} = split;
        my $key = $line{'mir_seq'};
        $data{$key}{$freq_id} = $line{'freq'};
    }
    close($input);
}

my @cols = sort keys %seen;
print join( "\t", "mir_seq", @cols ), "\n";
foreach my $mir_seq ( sort keys %data ) {
    my @output_cols = map { $_ // 0 } @{ $data{$mir_seq} }{@cols};
    print join( "\t", $mir_seq, @output_cols ), "\n";
}

给定数据集输出(制表符分隔):

mir_seq freq_944G   freq_94G
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA  1   0
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG  54  24
hsa-miR-1296-5p_TTAGGGCCCTGGCTCCATCT    0   17
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT  0   1

注意 - 如果值未定义,则当前将打印零。如果要打印其他内容,则需要修改该地图。

它也按字母顺序对大多数这些进行排序 - 这也可能不是你想要的,但是你可以参考很多排序的例子。