Question

我有数百个文件，每个文件都有不同数量的条目（＆gt; xxxx），并希望分别在所有文件中保留共享条目。我不确定这样做的最佳方法是什么，也许perl！我使用了sort，uniq of bash，但我没有得到正确答案。 ID的格式以＆gt;开头。并且在所有文件中跟随4个字符。

1.fa

>abcd
CTGAATGCC

2.fa

>abcd
AAATGCGCG
>efgh
CGTAC

3.fa

>abcd
ATGCAATA
>efgh
TAACGTAA
>ijkl
TGCAA

此示例的最终结果将是：

1.fa

>abcd
CTGAATGCC

2.fa

>abcd
AAATGCGCG

3.fa

>abcd
ATGCAATA

Answer 1

这个Perl程序会按照您的要求执行。它使用Perl内置的就地编辑功能，并将原始文件重命名为1.fa.bak等。您的数据中的空白行不应该出现问题只要序列始终位于ID

之后的一行上

use strict;
use warnings 'all';

my @files = glob '*.fa';

printf "Processing %d file%s\n", scalar @files, @files == 1 ? "" : "s";

exit if @files < 2;

my %ids;

{
    local @ARGV = @files;

    while ( <> ) {
        ++$ids{$1} if /^>(\S+)/;
    }
}

# remove keys that aren't in all files
delete @ids{ grep { $ids{$_} < @files } keys %ids };
my $n = keys %ids;
printf "%d ID%s common to all files\n", $n, $n == 1 ? '' : "s";

exit unless $n;

{
    local @ARGV = @files;
    local $^I = '.bak';

    while ( <> ) {

        next unless /^>(\S+)/ and $ids{$1};

        print;
        print scalar <>;
    }
}

Answer 2

这是Perl解决方案，可以帮助您：

use feature qw(say);
use strict;
use warnings;

my $file_dir = 'files';
chdir $file_dir;
my @files = <*.fa>;

my $num_files = scalar @files;
my %ids;
for my $file (@files) {
    open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
    while (my $id = <$fh>) {
        chomp $id;
        chomp (my $sequence = <$fh>);
        $ids{$id}++;
    }
    close $fh;
}

for my $file (@files) {
    open ( my $fh, '<', $file) or die "Could not open file '$file': $!";
    my $new_name = $file . '.new';
    open ( my $fh_write, '>', $new_name ) or die "Could not open file '$new_name': $!";
    while (my $id = <$fh>) {
        chomp $id;
        chomp (my $sequence = <$fh>);
        if ( $ids{$id} == $num_files ) {
            say $fh_write $id;
            say $fh_write $sequence;
        }
    }
    close $fh_write;
    close $fh;
}

它假定所有.fa文件都位于名为$file_dir的目录中，并将新序列写入同一目录中的新文件。新文件名获得.new扩展名。

在许多文件中保留共享条目

2 个答案: