Question

我有两个文件，file1.txt和file2.txt。 file1.txt有大约14K行，file2.txt有大约20亿行。 file1.txt每行只有一个字段f1，而file2.txt有3个字段，f1到f3，由|分隔。

我想查找来自file2.txt f1 file1.txt f2个file2.txt file2.txt foo1 foo2 ... bar1 bar2 ...的所有行（如果我们不在，那么就行了; t想要花费额外的时间来分割date1|foo1|number1 date2|foo2|number2 ... date1|bar1|number1 date2|bar2|number2 ...）的值。

file1.txt（大约14K行，未排序）：

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

file2.txt（大约20亿行，未排序）：

fgrep -F -f file1.txt file2.txt > file.matched

预期输出：

import time
import matplotlib.pyplot as plt 
dayn = 0
while True:
    t = raw_input()
    h,m,s = t[:1],t[3:5],t[6:]
    t_strc = time.strptime(str(dayn)+':'+t,'%d:%H:%M:%S')
    if int(h+m+s)>235958:
        dayn += 1
    r = time2num(t_strc) # a function read tm_mday and other values to give increasing number
    plt.plot(x,r) # x is your data , r is time
plt.show()

这是我尝试过的，似乎需要几个小时才能运行：

QLineEdit

我想知道是否有更好更快的方法可以使用常见的Unix命令或使用小脚本执行此操作。

Answer 1

Perl解决方案。 [请参阅下面的注意。]

对第一个文件使用哈希。当您逐行阅读大文件时，请通过正则表达式提取字段（捕获||之间的第一个模式）或split（获取第二个单词）并打印exists 。它们的速度可能略有不同（时间）。当defined使用split（已定义或）短路时，正则表达式中不需要//检查。

use warnings;
use strict;

# If 'prog smallfile bigfile' is the preferred use
die "Usage: $0 smallfile bigfile\n"  if @ARGV != 2;
my ($smallfile, $bigfile) = @ARGV;

open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";    
my %word = map { chomp; $_ => 1 } <$fh>;

open    $fh, '<', $bigfile or die "Can't open $bigfile: $!";       
while (<$fh>) 
{
    exists $word{ (/\|([^|]+)/)[0] } && print;  

    # Or
    #exists $word{ (split /\|/)[1] // '' } && print;
}
close $fh;

避免if分支并使用短路更快，但只是很少。在数十亿行上，这些调整加起来但又不过分。逐行读取小文件可能（或可能不是）稍微快一点，而不是像上面的列表上下文那样，但这应该不是显而易见的。

更新写入STDOUT可以保存两个操作，我反复计算它比写入文件快一点。这种用法也与大多数UNIX工具一致，所以我改为写STDOUT。接下来，不需要exists测试，并且丢弃它会使操作失效。但是，我总是通过它获得更好的运行时间，同时它也更好地传达了目的。我总是把它留下来。感谢ikegami的评论。

注意根据我的基准测试，已注释的版本比其他版本快<50％。给出这些是因为它们不同，一个找到第一个匹配而另一个找到第二个字段。我保持这种方式作为一种更通用的选择，因为这个问题很模糊。

一些比较（基准）[更新以便写入STDOUT，请参阅上面的“更新”

answer by HåkonHægland中有一个广泛的分析，大多数解决方案的运行时间。这是另一个考虑因素，对上述两个解决方案进行基准测试，OP自己的答案以及发布的fgrep一个，预计会很快并在问题和许多答案中使用。

我以下列方式构建测试数据。对于两个文件，大致如图所示的少数几行的长度用随机字构成，因此在第二个字段中匹配。然后我用这些“种子”填充不匹配的数据样本，以模仿OP引用的大小和匹配之间的比率：对于小文件中的 14K 行，有 1.3M 大文件中的行，产生 126K 匹配。然后重复编写这些样本以构建完整的数据文件作为OP，shuffle - 每次使用List::Util。

以下比较的所有运行产生106_120匹配上述文件大小（diff - 用于检查），因此匹配频率足够接近。它们通过使用my $res = timethese(60 ...)调用完整程序进行基准测试。 v5.16上cmpthese($res)的结果是

Rate regex cfor split fgrep regex 1.05/s -- -23% -35% -44% cfor 1.36/s 30% -- -16% -28% split 1.62/s 54% 19% -- -14% fgrep 1.89/s 80% 39% 17% --

优化的C程序fgrep排在首位的事实并不令人惊讶。 “正则表达式”落后于“ split ”的延迟可能是由于启动引擎进行少量匹配的开销，很多次。考虑到不断发展的正则表达式引擎优化，这可能会因Perl版本而异。我包含了@codeforester（“ cfor ”）的答案，因为它声称速度最快，并且20%落后于非常相似的“拆分”可能是由于分散的小的低效率（见这个答案下面的评论）。^†

这并没有什么不同，虽然硬件和软件以及数据细节之间确实存在差异。我在不同的Perls和机器上运行它，显着的区别是在某些情况下fgrep确实快了一个数量级。

OP的经验非常缓慢fgrep令人惊讶。鉴于他们引用的运行时间，比上述速度慢几个数量级，我猜想有一个旧系统“怪”。

尽管这完全基于I / O，但是将它放在多个内核上会带来并发效益，而且我预计会有一个很好的加速，最多可达几个。

^†唉，评论被删除了（？）。简而言之：不需要使用if分支的defined分支，printf而不是print（慢！）的标量（成本）。这对20亿线的效率至关重要。

Answer 2

您是否尝试过Awk可以加速一些事情：

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

（或）使用index()中的Awk功能，如Benjamin W.的评论所示，

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (index($0,i)) {print; break}}' file1.txt FS='|' file2.txt

（或）Ed Morton在评论

中建议的更直接的正则表达式匹配

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if ($0~i) {print; break}}' file1.txt FS='|' file2.txt

就是你所需要的。我猜这会更快但不完全确定有百万+条目的文件。这里的问题在于沿线任何地方的可能性匹配。如果在任何特定列中都有相同的内容（例如单独说$2），则可以采用更快的方法

awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

此外，您可以通过使用系统中的locale设置来加快速度。从这个关于这个主题的精彩Stéphane Chazelas's answer解释，你可以通过将语言环境LC_ALL=C传递给正在运行的本地命令来快速加速。

在任何基于GNU的系统上，locale

的默认值

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

使用一个变量LC_ALL，您可以将所有LC_类型变量一次性设置为指定的区域设置

$ LC_ALL=C locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"       
LC_ALL=C

那么这会产生什么影响呢？

简单地说，当使用locale C时，它将默认使用服务器的ASCII基本Unix / Linux语言。基本上当你grep时，默认情况下你的语言环境将被国际化并设置为UTF-8，它可以代表Unicode字符集中的每个字符，以帮助显示任何世界的写作系统，当前超过110,000个唯一字符，而ASCII每个字符都以单字节序列编码，其字符集包含不超过128个唯一字符。

所以它转换为这个，当在grep字符集中编码的文件上使用UTF-8时，它需要将每个字符与十万个唯一字符中的任何一个匹配，但只需{{1在128中，请使用ASCII作为

fgrep

此外，同样可以适用于LC_ALL=C fgrep -F -f file1.txt file2.txt，因为它与Awk调用使用regex匹配，设置match($0,i)语言环境可以加快字符串的速度匹配。

Answer 3

假设：1。您希望仅在本地工作站上运行此搜索。 2.你有多个核心/ cpus来利用并行搜索。

parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt

根据具体情况进行进一步调整： A.使用LANG = C禁用NLS（这已在另一个答案中提到） B.使用-m标志设置最大匹配数。

注意：我猜测file2是〜4GB并且10M块大小没问题，但您可能需要优化块大小以获得最快的运行速度。

Answer 4

一小段Perl代码解决了这个问题。这是采取的方法：

将file1.txt的行存储在哈希
逐行阅读file2.txt，解析并提取第二个字段
检查提取的字段是否在哈希值中;如果是这样，请打印

以下是代码：

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;

# store contents of small file in a hash
while (<$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
  # no need for chomp
  $field = (split(/\|/, $_))[1];
  if (defined($field) && exists($small_hash{$field})) {
    printf("%s", $_);
  }
}

close($big_fp);
exit(0);

我使用file1.txt中的14K行和file2.txt中的1.3M行运行上述脚本。它在大约13秒内完成，产生了126K的比赛。以下是相同的time输出：

real    0m11.694s
user    0m11.507s
sys 0m0.174s

我跑了@ Inian的awk代码：

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

它比Perl解决方案慢，因为它为file2.txt中的每一行循环14K次 - 这真的很贵。它在处理file2.txt的592K记录并产生40K匹配线后中止。这是花了多长时间：

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s

使用@ Inian的其他awk解决方案，消除了循环问题：

time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

awk在这里非常令人印象深刻，因为我们没有必要编写完整的程序来完成它。

我也运行了@ oliv的Python代码。完成这项工作花了大约15个小时，看起来它产生了正确的结果。构建一个巨大的正则表达式不如使用哈希查找效率高。这里是time输出：

real    895m14.862s
user    806m59.219s
sys 1m12.147s

我尝试按照建议使用parallel。但是，即使块大小非常小，它也会因fgrep: memory exhausted错误而失败。

让我感到惊讶的是fgrep完全不适合这种情况。我在22小时后中止它，它产生了大约100K的比赛。 我希望fgrep可以选择强制将-f file的内容保存在哈希中，就像Perl代码所做的那样。

我没有检查join方法 - 我不想要额外的文件排序开销。另外，鉴于fgrep性能不佳，我不相信join会比Perl代码做得更好。

感谢大家的关注和回应。

Answer 5

此Perl脚本（a）生成正则表达式模式：

#!/usr/bin/perl

use strict;
use warnings;

use Regexp::Assemble qw( );

chomp( my @ids = <> );
my $ra = Regexp::Assemble->new();
$ra->add(quotemeta($_)) for @ids;
print("^[^|]*\\|(?:" . (re::regexp_pattern($ra->re()))[0] . ")\\|");

以下是它的使用方法：

$ LC_ALL=C grep -P "$( a file1.txt )" file2.txt
date1|foo1|number1
date2|foo2|number2
date1|bar1|number1
date2|bar2|number2

请注意，该脚本使用Regexp :: Assemble，因此您可能需要安装它。

sudo su
cpan Regexp::Assemble

注意：

与被称为BOC1，BOC2，codeforester_orig，gregory1，inian2，inian4和oliv的解决方案不同，我的解决方案正确处理
```
file1.txt
foo1

file2.txt
date1|foo12|number5
```
我应该优于@BOC的类似solution，因为该模式已经过优化以减少回溯。（如果file2.txt中有三个以上的字段，那么我也会工作，而链接的解决方案可能会失败。）
我不知道它与split + dictionary解决方案的比较。

Answer 6

以下是使用Inline::C加速搜索大文件中匹配字段的Perl解决方案：

use strict;
use warnings;
use Inline C => './search.c';

my $smallfile = 'file1.txt';
my $bigfile   = 'file2.txt';

open my $fh, '<', $smallfile or die "Can't open $smallfile: $!";
my %word = map { chomp; $_ => 1 } <$fh>;
search( $bigfile, \%word );

使用perlapi在纯C中实现search()子例程，以查找小文件字典%words中的键：

<强> SEARCH.C ：

#include <stdio.h>
#include <sys/stat.h> 
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>


#define BLOCK_SIZE 8192       /* how much to read from file each time */
static char read_buf[BLOCK_SIZE + 1];

/*  reads a block from file, returns -1 on error, 0 on EOF, 
     else returns chars read, pointer to buf, and pointer to end of buf  */
size_t read_block( int fd, char **ret_buf, char **end_buf ) {
    int ret;
    char *buf = read_buf;
    size_t len = BLOCK_SIZE;
    while (len != 0 && (ret = read(fd, buf, len)) != 0) {
        if (ret == -1) {
            if (errno == EINTR)
                continue;
            perror( "read" );
            return ret;
        }
        len -= ret;
        buf += ret;
    }
    *end_buf = buf;
    *ret_buf = read_buf;
    return (size_t) (*end_buf - *ret_buf);
}

/* updates the line buffer with the char pointed to by cur,
   also updates cur
    */
int update_line_buffer( char **cur, char **line, size_t *llen, size_t max_line_len ) {
    if ( *llen > max_line_len ) {
        fprintf( stderr, "Too long line. Maximimum allowed line length is %ld\n",
                 max_line_len );
        return 0;
    }
    **line = **cur;
    (*line)++;
    (*llen)++;
    (*cur)++; 
    return 1;
}


/*    search for first pipe on a line (or next line if this is empty),
    assume line ptr points to beginning of line buffer.
  return 1 on success
  Return 0 if pipe could not be found for some reason, or if 
    line buffer length was exceeded  */
int search_field_start(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len
) {
    char *line_start = *line;

    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return 0;
        }
        if ( **cur == '|' ) break;
        /* Currently we just ignore malformed lines ( lines that do not have a pipe,
           and empty lines in the input */
        if ( **cur == '\n' ) {
            *line = line_start;
            *llen = 0;
            (*cur)++;
        }
        else {
            if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        }
    }
    return 1;
}

/* assume cur points at starting pipe of field
  return -1 on read error, 
  return 0 if field len was too large for buffer or line buffer length exceed,
  else return 1
  and field, and  length of field
 */
int copy_field(
    int fd, char **cur, char **end_buf, char *field,
    size_t *flen, char **line, size_t *llen, size_t max_field_len, size_t max_line_len
) {
    *flen = 0;
    while( 1 ) {
        if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return -1;
        }
        if ( **cur == '|' ) break;
        if ( *flen > max_field_len ) {
            printf( "Field width too large. Maximum allowed field width: %ld\n",
                    max_field_len );
            return 0;
        }
        *field++ = **cur;
        (*flen)++;
    }
    /* It is really not necessary to null-terminate the field 
       since we return length of field and also field could 
       contain internal null characters as well
    */
    //*field = '\0';
    return 1;
}

/* search to beginning of next line,
  return 0 on error,
  else return 1 */
int search_eol(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len)
{
    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res <= 0) return 0;
        }
        if ( !update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *(*cur-1) == '\n' ) {
            break;
        }
    }
    //**line = '\0'; // not necessary
    return 1;
}

#define MAX_FIELD_LEN 80  /* max number of characters allowed in a field  */
#define MAX_LINE_LEN 80   /* max number of characters allowed on a line */

/* 
   Get next field ( i.e. field #2 on a line). Fields are
   separated by pipes '|' in the input file.
   Also get the line of the field.
   Return 0 on error,
   on success: Move internal pointer to beginning of next line
     return 1 and the field.
 */
size_t get_field_and_line_fast(
    int fd, char *field, size_t *flen, char *line, size_t *llen
) {
    static char *cur = NULL;
    static char *end_buf = NULL;

    size_t res;
    if (cur == NULL) {
        res = read_block( fd, &cur, &end_buf );        
        if ( res <= 0 ) return 0;
    }
    *llen = 0;
    if ( !search_field_start( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN )) return 0;
    if ( (res = copy_field(
        fd, &cur, &end_buf, field, flen, &line, llen, MAX_FIELD_LEN, MAX_LINE_LEN
    ) ) <= 0)
        return 0;
    if ( !search_eol( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN ) ) return 0;
    return 1;
}

void search( char *filename, SV *href) 
{
    if( !SvROK( href ) || ( SvTYPE( SvRV( href ) ) != SVt_PVHV ) ) {
        croak( "Not a hash reference" );
    }

    int fd = open (filename, O_RDONLY);
    if (fd == -1) {
        croak( "Could not open file '%s'", filename );
    }
    char field[MAX_FIELD_LEN+1];
    char line[MAX_LINE_LEN+1];
    size_t flen, llen;
    HV *hash = (HV *)SvRV( href );
    while ( get_field_and_line_fast( fd, field, &flen, line, &llen ) ) {
        if( hv_exists( hash, field, flen ) )
            fwrite( line, sizeof(char), llen, stdout);
    }
    if (close(fd) == -1)
        croak( "Close failed" );

}

测试表明它比最快的纯Perl解决方案（参见我other answer中的方法zdim2）快约3倍。

Answer 7

你能试试$ cat d.txt bar1 bar2 foo1 foo2 $ cat e.txt date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 $ join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 d.txt e.txt date1|bar1|number1 date2|bar2|number2 date1|foo1|number1 date2|foo2|number2吗？文件必须排序......

Error: Unexpected token / in JSON at position 1057

小更新：
通过在联接前使用LC_ALL = C，事情真的加快了，可以在Håkon Hægland

的基准测试中看到

PS1：我怀疑如果加入比grep -f更快......

Answer 8

虽然这个帖子已经结束了，但是在这篇文章中收集了两个文件之间的所有类似grep的方法，为什么不在胜利的Inian的awk解决方案中添加这个awk替代方案，类似（甚至改进）：

awk 'NR==FNR{a[$0]=1;next}a[$2]' patterns.txt FS="|" datafile.txt >matches.txt # For matches restricted on Field2 of datafile

这相当于Inian awk $2 in hash解决方案，但它可能更快，因为我们不要求awk检查整个哈希数组是否包含$ 2的file2 - 我们只检查是否[ 2美元是否具有价值。

在从创建哈希数组中读取第一个模式文件时，我们也分配了一个值。

如果之前在模式文件中找到$2数据文件，那么a[$2]将有一个值，因此将被打印，因为它不是空的。

如果数据文件的a[$2]没有返回值（null），则转换为false =＆gt;没有打印。

扩展以匹配数据文件的三个字段中的任何一个：

awk 'NR==FNR{a[$0]=1;next}(a[$1] || a[$2] || a[$3])' patterns.txt FS="|" datafile.txt >matches.txt. #Printed if any of the three fields of datafile match pattern.

在这两种情况下，在awk前面应用 LC_ALL = C ，似乎可以加快速度。

PS1：Offcourse这个解决方案也有所有awk解决方案的缺陷。不是模式匹配。是两个文件之间的直接/固定匹配，就像大多数解决方案一样。

PS2：在我使用Håkon Hægland的小型基准测试文件的糟糕机器基准测试中，与awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

相比，性能提升约20％

Answer 9

可能的方法是使用python：

$ cat test.py
import sys,re

with open(sys.argv[1], "r") as f1:
    patterns = f1.read().splitlines() # read pattern from file1 without the trailing newline

m = re.compile("|".join(patterns))    # create the regex

with open(sys.argv[2], "r") as f2:
    for line in f2: 
        if m.search(line) : 
            print line,               # print line from file2 if this one matches the regex

并像这样使用它：

python test.py file1.txt file2.txt

Answer 10

您也可以使用Perl：

请注意，这会占用内存，而您的机器/服务器也会更好。

示例数据：

%_STATION@gaurav * /root/ga/pl> head file1.txt file2.txt
==> file1.txt <==
foo1
foo2
...
bar1
bar2
...

==> file2.txt <==
date1|foo1|number1
date2|foo2|number2
date3|foo3|number3
...
date1|bar1|number1
date2|bar2|number2
date3|bar3|number3
%_STATION@gaurav * /root/ga/study/pl>

脚本输出：脚本将在名为output_comp的文件中生成最终输出。

%_STATION@gaurav * /root/ga/pl> ./comp.pl  file1.txt file2.txt ; cat output_comp
date1|bar1|number1
date2|bar2|number2
date2|foo2|number2
date1|foo1|number1
%_STATION@gaurav * /root/ga/pl>

<强>脚本：

%_STATION@gaurav * /root/ga/pl> cat comp.pl
#!/usr/bin/perl

use strict ;
use warnings ;
use Data::Dumper ;

my ($file1,$file2) = @ARGV ;
my $output = "output_comp" ;
my %hash ;    # This will store main comparison data.
my %tmp ;     # This will store already selected results, to be skipped.
(scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;

# Read all files at once and use their name as the key.
for (@ARGV) {
  open FH, "<$_" or die "Cannot open $_\n" ;
  while  (my $line = <FH>) {chomp $line ;$hash{$_}{$line} = "$line"}
  close FH ;
}

# Now we churn through the data and compare to generate
# the sorted output in the output file.
open FH, ">>$output" or die "Cannot open outfile!\n" ;
foreach my $k1 (keys %{$hash{$file1}}){
  foreach my $k2 (keys %{$hash{$file2}}){
    if ($k1 =~ m/^.+?$k2.+?$/) {
      if (!defined $tmp{"$hash{$file2}{$k2}"}) {
        print FH "$hash{$file2}{$k2}\n" ;
        $tmp{"$hash{$file2}{$k2}"} = 1 ;
      }
    }
  }
}
close FH  ;
%_STATION@gaurav * /root/ga/pl>

感谢。

Answer 11

恕我直言，grep是一个很好的工具，针对巨大的file2.txt进行了高度优化，但可能不适合搜索这么多模式。我建议将file1.txt的所有字符串组合成一个巨大的正则表达式，如\ | bar1 | bar2 | foo1 | foo2 \ |

echo  '\|'$(paste -s -d '|' file1.txt)'\|' > regexp1.txt

grep -E -f regexp1.txt file2.txt > file.matched

当然LANG = C可能有所帮助。请提供反馈或发送您的文件，以便我自己测试。

Answer 12

使用 flex ：

1：构建flex处理器：

$ awk 'NR==1{ printf "%%%%\n\n.*\\|(%s",$0 } 
            { printf "|%s",$0 } 
       END  { print ")\\|.*\\n ECHO;\n.*\\n ;\n%%\n" }' file1.txt > a.fl

2：编译

$ flex -Ca -F a.fl ; cc -O lex.yy.c -lfl

3：然后运行

$ a.out < file2.txt  > out

编译（cc ...）是一个缓慢的过程;这种方法只需支付个案费用稳定的file1.txt

（在我的机器中）在10_000_000＆＃34;中运行搜索＆＃34; 100所花费的时间这种方法的测试速度比LC_ALL=C fgrep...

快3倍

Answer 13

设置语言等可能会有所帮助。

否则我想不出一个逃避基本问题的神奇解决方案：数据没有结构化，因此您将进行搜索归结为file1中的行数乘以file2中的行数。

将十亿行放在数据库中，并以智能方式对其进行索引，这是我能想到的唯一加速。那个指数必须非常聪明，不过......

简单的解决方案是：拥有足够的内存以适应所有内容。否则你无能为力......

在Bash

13 个答案:

1：构建flex处理器：

2：编译

3：然后运行