Question

我有两个文件，目标和清理。

Target有一些1055772行，每行有3000列，制表符分隔。（大小为7.5G）
清洁略短于806535.清洁只有一列，与Target的第一列的格式相匹配。（大小为13M）

我想将具有匹配的第一列的目标行提取到干净的行。

我写了一个基于grep的循环来做这个，但它的速度很慢。加速将通过upvotes和/或表情来奖励。

clean  = "/path/to/clean"
target = "/path/to/target"
oFile  = "/output/file"

head -1 $target > $oFile
cat $clean | while read snp; do
    echo $snp
    grep $snp $target >> $oFile
done

$ head $clean
1_111_A_G
1_123_T_A
1_456_A_G
1_7892_C_G

编辑：写了一个简单的python脚本来做到这一点。

 clean_variants_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/clean_variants"

allChr_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/allChr_snpStats"

outfile = open("/scratch2/vyp-scratch2/cian/UCLex_August2014/results.tab","w")

 clean_variant_dict = {}


for line in open(clean_variants_file):

clean_variant_dict[line.strip()] = 0


for line in open(allChr_file):

ll = line.strip().split("\t")

id_ = ll[0]

if id_ in clean_variant_dict:

    outfile.write(line)



 outfile.close()

Answer 1

这个Perl解决方案会占用大量内存（因为我们将整个文件加载到内存中），但可以避免循环两次。它使用哈希进行重复检查，其中每一行都存储为密钥。请注意，此代码未经过全面测试，但似乎适用于一组有限的数据。

use strict;
use warnings;

my ($clean, $target) = @ARGV;

open my $fh, "<", $clean or die "Cannot open file '$clean': $!";

my %seen;
while (<$fh>) {
    chomp;
    $seen{$_}++;
}

open $fh, "<", $target 
        or die "Cannot open file '$target': $!";    # reuse file handle

while (<$fh>) {
    my ($first) = /^([^\t]*)/;
    print if $seen{$first};
}

如果目标文件是正确的制表符分隔的CSV数据，则可以使用Text::CSV_XS，据说速度非常快。

Answer 2

python解决方案：

with open('/path/to/clean', 'r') as fin:
    keys = set(fin.read().splitlines())

with open('/path/to/target', 'r') as fin, open('/output/file', 'w') as fout:
    for line in fin:
        if line[:line.index('\t')] in keys:
            fout.write(line)

Answer 3

使用perl one-liner：

perl -F'\t' -lane '
    BEGIN{ local @ARGV = pop; @s{<>} = () }
    print if exists $s{"$F[0]\n"}
  ' target clean

开关：

-F：-a切换
-l：启用行结束处理
-a：拆分空间线并将其加载到数组@F
-n：为输入文件中的每个“行”创建一个while(<>){...}循环。
-e：告诉perl在命令行上执行代码。

或者作为perl脚本：

use strict;
use warnings;

die "Usage: $0 target clean\n" if @ARGV != 2;

my %s = do {
    local @ARGV = pop;
    map {$_ => 1} (<>)
};

while (<>) {
    my ($f) = split /\t/;
    print if $s{"$f\n"}
}

Answer 4

为了好玩，我想我会将一两个解决方案转换为Perl6。

注意：在Rakudo / NQP获得更多优化之前，这些可能会比原始速度慢，这实际上只是在发布时才开始真正开始。

首先，TLP's Perl5 answer几乎一对一地转换为Perl6。

#! /usr/bin/env perl6
# I have a link named perl6 aliased to Rakudo on MoarVM-jit

use v6;

multi sub MAIN ( Str $clean, Str $target ){ # same as the Perl5 version
    MAIN( :$clean, :$target ); # call the named version
}

multi sub MAIN ( Str :$clean!, Str :$target! ){ # using named arguments

    note "Processing clean file";

    my %seen := SetHash.new;

    for open( $clean, :r ).lines -> $line {
        next unless $line.chars; # skip empty lines
        %seen{$line}++;
    }

    note "Processing target file";

    for open( $target, :r ).lines -> $line {
        $line ~~ /^ $<first> = <-[\t]>+ /;
        say $line if %seen{$<first>.Str};
    }
}

我使用了MAIN子例程，因此如果您没有给出正确的参数，您将收到Usage消息。
我还使用了SetHash代替常规Hash来减少内存使用，因为我们不需要知道我们找到了多少内存，只是发现了它们。

接下来，我尝试将clean文件中的所有行合并为一个正则表达式。

这类似于the sed and grep answer的Cyrus，除了许多正则表达式之外只有一个。

我不想更改我已编写的子例程，因此我添加了一个通过在命令行中添加--single-regex或-s来区分的子例程。（所有示例都在同一个文件中）

multi sub MAIN ( Str :$clean!, Str :$target!, Bool :single-regex(:s($))! ){

    note "Processing clean file";

    my $regex;
    {
        my @regex = open( $clean, :r ).lines.grep(*.chars);
        $regex = /^ [ | @regex ] /;
    } # throw away @regex

    note "Processing target file";

    for open( $target, :r ).lines -> $line {
        say $line if $line ~~ $regex;
    }
}

我会说我花了相当长的时间来写这篇文章，而不是用它在Perl5中编写它。大部分时间都在线搜索一些习语，并查看Rakudo的源文件。我认为在Perl6上比Perl5更好地花费很多精力。

从第一列与另一个文件匹配的文件中提取行的更有效方法

4 个答案: