Question

我想在带有-f的文本文件上使用grep来匹配长列表（10,000）的模式。事实证明，grep不喜欢这个（谁知道？）。一天后，它没有产生任何东西。较小的列表几乎可以瞬间完成。

我在想我可能会把我的长列表分开并做几次。知道模式列表的最大长度是多少？

另外，我对unix很新。欢迎采用其他方法。模式列表或搜索术语位于纯文本文件中，每行一个。

感谢大家的指导。

Answer 1

从评论中看，您匹配的模式似乎是固定字符串。如果是这种情况，您一定要使用-F。这将大大提高匹配的速度。（使用479,000个字符串匹配输入文件与3行使用-F在中等功率机器上使用不到1.5秒。不使用-F，相同的机器在几分钟后仍未完成。）< / p>

Answer 2

我遇到了同样的问题。在900万行的文件中搜索400万个模式。好像这是RAM的问题。所以我得到了这个整洁的小工作，这可能比分裂和加入慢，但它只需要这一行。

 while read line; do grep $line fileToSearchIn;done < patternFile

我需要使用这项工作，因为-F标志不是那些大文件的解决方案......

编辑：对于大文件来说，这似乎很慢。经过一些研究后，我发现了'faSomeRecords'和其他来自Kent NGS-editing-Tools的其他很棒的工具

我通过从550万条记录文件中提取200万个fasta-rec来自行尝试。约。 30秒..

欢呼声

编辑：direct download link

Answer 3

这是一个可以在您的文件上运行的bash脚本（或者如果您愿意，可以在您的文件的子集上运行）。它会将密钥文件拆分为越来越大的块，并为每个块尝试grep操作。操作是定时的 - 现在我正在为每个grep操作计时，以及处理所有子表达式的总时间。输出是在几秒钟内 - 通过一些努力你可以得到ms，但有问题你有它不太可能你需要这种粒度。使用窗体

的命令在终端窗口中运行脚本

./timeScript keyFile textFile 100 > outputFile

这将运行脚本，使用keyFile作为存储搜索键的文件，将textFile作为要查找键的文件，并将100作为初始块大小。在每个循环中，块大小将加倍。

在第二个终端中，运行命令

tail -f outputFile

将跟踪其他进程的输出到文件outputFile

我建议您打开第三个终端窗口，并在该窗口中运行top。您将能够看到您的进程正在消耗多少内存和CPU - 再次，如果您看到大量内存消耗，它会给您一个暗示事情进展不顺利的提示。

这可以让你找出事情何时开始变慢 - 这是你问题的答案。我不认为有一个“神奇的数字” - 它可能取决于你的机器，特别是你的文件大小和内存量。

您可以获取脚本的输出并通过grep：

grep entire outputFile

您最终只会得到摘要 - 块大小和所花费的时间，例如：

Time for processing entire file with blocksize 800: 4 seconds

如果您将这些数字相互映射（或只是检查数字），您将看到算法何时最佳，何时减速。

以下是代码：我没有进行大量的错误检查，但它似乎对我有用。显然，在你的终极解决方案中，你需要对grep的输出做一些事情（而不是将它汇总到wc -l我只是为了看看有多少行匹配）......

#!/bin/bash
# script to look at difference in timing
# when grepping a file with a large number of expressions
# assume first argument = name of file with list of expressions
# second argument = name of file to check
# optional third argument = initial block size (default 100)
#
# split f1 into chunks of 1, 2, 4, 8... expressions at a time
# and print out how long it took to process all the lines in f2

if (($# < 2 )); then
  echo Warning: need at leasttwo parameters.
  echo Usage: timeScript keyFile searchFile [initial blocksize]
  exit 0
fi

f1_linecount=`cat $1 | wc -l`
echo linecount of file1 is $f1_linecount

f2_linecount=`cat $2 | wc -l`
echo linecount of file2 is $f2_linecount
echo

if (($# < 3 )); then
  blockLength=100
else
  blockLength=$3
fi

while (($blockLength < f1_linecount))
do
  echo Using blocks of $blockLength
  #split is a built in command that splits the file
  # -l tells it to break after $blockLength lines
  # and the block$blockLength parameter is a prefix for the file
  split -l $blockLength $1 block$blockLength
  Tstart="$(date +%s)"
  Tbefore=$Tstart

  for fn in block*
    do
      echo "grep -f $fn $2 | wc -l"
      echo number of lines matched: `grep -f $fn $2 | wc -l`
      Tnow="$(($(date +%s)))"
      echo Time taken: $(($Tnow - $Tbefore)) s
      Tbefore=$Tnow
    done
  echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
  blockLength=$((2*$blockLength))
  # remove the split files - no longer needed
  rm block*
  echo block length is now $blockLength and f1 linecount is $f1_linecount
done

exit 0

Answer 4

你当然可以尝试看看你是否得到了更好的结果，但要对任何大小的文件进行任何一种方式都需要做很多工作。你没有提供有关你的问题的任何细节，但如果你有10k模式，我会考虑是否有某种方法将它们概括为较少数量的正则表达式。

Answer 5

这是一个perl脚本＆＃34; match_many.pl＆＃34;它解决了＆＃34;大量密钥与大量记录的非常常见的子集＆＃34;问题。从stdin每行接受一个密钥。两个命令行参数是要搜索的文件的名称和必须与键匹配的字段（空格分隔）。原始问题的这个子集可以快速解决，因为记录中匹配的位置（如果有的话）是提前知道的，并且密钥总是对应于记录中的整个字段。在一个典型的案例中，它搜索了9400265个记录，其中42899个密钥，匹配42401个密钥，并在41s中发出1831944个记录。更一般的情况是，密钥可能在记录的任何部分中显示为子字符串，这是该脚本无法解决的更难的问题。（如果键永远不会包含空格并且始终对应于整个单词，则可以修改脚本以通过迭代每个记录的所有字段来处理该情况，而不是仅测试该字段，代价是运行M倍慢，其中M是找到匹配项的平均字段数。）

#!/usr/bin/perl -w
use strict;
use warnings;
my $kcount;
my ($infile,$test_field) = @ARGV;
if(!defined($infile) || "$infile" eq "" || !defined($test_field) || ($test_field <= 0)){
  die "syntax: match_many.pl infile field" 
}
my %keys;       # hash of keys
$test_field--;  # external range (1,N) to internal range (0,N-1)

$kcount=0;
while(<STDIN>) {
   my $line = $_;
   chomp($line);
   $keys {$line} = 1;
   $kcount++
}
print STDERR "keys read: $kcount\n";

my $records = 0;
my $emitted = 0;
open(INFILE, $infile )  or die "Could not open $infile";
while(<INFILE>) {
   if(substr($_,0,1) =~ /#/){ #skip comment lines
     next;
   }
   my $line = $_;
   chomp($line);
   $line =~ s/^\s+//;
   my @fields = split(/\s+/, $line);
   if(exists($keys{$fields[$test_field]})){
      print STDOUT "$line\n";
      $emitted++;
      $keys{$fields[$test_field]}++;
   }
   $records++;
}

$kcount=0;
while( my( $key, $value ) = each %keys ){
   if($value > 1){ 
      $kcount++; 
   }
}

close(INFILE);
print STDERR "records read: $records, emitted: $emitted; keys matched: $kcount\n";

exit;

grep -f最大模式数量？

5 个答案: