Question

关于代码性能的问题：我正在尝试针对~20g文本文件运行~25个正则表达式规则。脚本应该输出匹配到文本文件;每个正则表达式规则生成自己的文件。请参阅下面的伪代码：

regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
    while read line
    # Each $line in the looped-through file contains a regex rule, e.g.,
    # egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
    # $rname is a unique rule name generated by a separate bash function
    # exported to the current shell.
        do
        cmd="$line $tmp > ~/outputdir/$tmp.$rname.filter.piped &"
        eval $cmd
    done < $regex_rules
done

几个想法：

有没有办法循环文本文件一次，评估所有规则并一次性拆分到单个文件？这会更快吗？
我应该使用不同的工具来完成这项工作吗？

感谢。

Answer 1

这是grep选项-f的原因。将regexrulefile.txt简化为正则表达式，每行一个，然后运行

egrep -f regexrulefile.txt the_big_file

这会在单个输出流中生成所有匹配项，但您可以在其后执行循环操作以将它们分开。假设组合的比赛列表不是很大，这将是一场表现胜利。

Answer 2

快速（！=太快）Perl解决方案：

#!/usr/bin/perl
use strict; use warnings;

我们预加载正则表达式，以便我们只读取一次文件。它们存储在数组@regex中。正则表达式文件是作为参数给出的第一个文件。

open REGEXES, '<', shift(@ARGV) or die;
my @regex = map {qr/$_/} <REGEXES>;
# use the following if the file still includes the egrep:
# my @regex = map {
#     s/^egrep \s+ -i \s+ '? (.*?) '? \s* $/$1/x;
#     qr{$_}
# } <REGEXES>;
close REGEXES or die;

我们浏览作为参数提供的每个剩余文件：

while (@ARGV) {
  my $filename = shift @ARGV;

我们预先打开文件以提高效率：

  my @outfile = map {
     open my $fh, '>', "outdir/$filename.$_.filter.piped"
       or die "Couldn't open outfile for $filename, rule #$_";
     $fh;
  } (1 .. scalar(@rule));
  open BIGFILE, '<', $filename or die;

我们将符合规则的所有行打印到指定文件。

  while (not eof BIGFILE) {
    my $line = <BIGFILE>;
    for $ruleNo (0..$#regex) {
      print $outfile[$ruleNo] $line if $line =~ $regex[$ruleNo];
      # if only the first match is interesting:
      # if ($line =~ $regex[$ruleNo]) {
      #     print $outfile[$ruleNo] $line;
      #     last;
      # }
    }
  }

在下一次迭代之前清理：

  foreach (@outfile) {
    close $_ or die;
  }
  close BIGFILE or die;
}

print "Done";

调用：$ perl ultragrepper.pl regexFile bigFile1 bigFile2 bigFile3等。任何更快的内容都必须直接用C语言编写。您的硬盘数据传输速度是限制。

这应该作为bash pendant运行得更快，因为我避免重新打开文件或重新分析正则表达式。此外，不必为外部工具生成新的流程。但是我们可以产生几个线程！（至少NumOfProcessors * 2线程可能是明智的）

local $SIG{CHLD} = undef;
while (@ARGV) {
    next if fork();
    ...;
    last;
}

Answer 3

我做了与lex类似的事情。当然，它每隔一天运行一次，所以YMMV。即使在远程Windows共享上的几百兆字节文件上，它也非常快。处理只需几秒钟。我不知道你是一个快速C程序是多么的舒服，但我发现这是解决大规模正则表达式问题的最快，最简单的解决方案。

为了保护有罪而编辑的部分：

    /************************************************** 
        start of definitions section

    ***************************************************/


%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <getopt.h>
#include <errno.h>

char inputName[256];
// static insert variables

//other variables
char tempString[256];
char myHolder[256];
char fileName[256];
char unknownFileName[256];
char stuffFileName[256];
char buffer[5];

/* we are using pointers to hold the file locations, and allow us to dynamically open and close new files */
/* also, it allows us to obfuscate which file we are writing to, otherwise this couldn't be done */

FILE *yyTemp;
FILE *yyUnknown;
FILE *yyStuff;

// flags for command line options
static int help_flag = 0;

%}

%option 8bit 
%option nounput nomain noyywrap 
%option warn

%%
    /************************************************ 
        start of rules section
    *************************************************/


(\"A\",\"(1330|1005|1410|1170)\") { 
    strcat(myHolder, yytext);
    yyTemp = &(*yyStuff);
} //stuff files

. { strcat(myHolder, yytext); }

\n  {
    if (&(*yyTemp) == &(*yyUnknown))
        unknownCount += 1;
    strcat(myHolder, yytext); 
    //print to file we are pointing at, whatever it is
    fprintf(yyTemp, "%s", myHolder);
    strcpy(myHolder, "");
    yyTemp = &(*yyUnknown);
}

<<EOF>> {
    strcat(myHolder, yytext); 
    fprintf(yyTemp, "%s", myHolder);
    strcpy(myHolder, "");
    yyTemp = &(*yyUnknown);

    yyterminate();
}

%%
    /**************************************************** 
        start of code section


    *****************************************************/


int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
    /****************************************************
        The main method drives the program. It gets the filename from the
        command line, and opens the initial files to write to. Then it calls the lexer.
        After the lexer returns, the main method finishes out the report file,
        closes all of the open files, and prints out to the command line to let the
        user know it is finished.
    ****************************************************/

    int c;

    // the gnu getopt library is used to parse the command line for flags
    // afterwards, the final option is assumed to be the input file

    while (1) {
        static struct option long_options[] = {
            /* These options set a flag. */
            {"help",   no_argument,     &help_flag, 1},
            /* These options don't set a flag. We distinguish them by their indices. */
            {0, 0, 0, 0}
        };
           /* getopt_long stores the option index here. */
        int option_index = 0;
        c = getopt_long (argc, argv, "h",
            long_options, &option_index);

        /* Detect the end of the options. */
        if (c == -1)
            break;

        switch (c) {
            case 0:
               /* If this option set a flag, do nothing else now. */
               if (long_options[option_index].flag != 0)
                 break;
               printf ("option %s", long_options[option_index].name);
               if (optarg)
                 printf (" with arg %s", optarg);
               printf ("\n");
               break;

            case 'h':
                help_flag = 1;
                break;

            case '?':
               /* getopt_long already printed an error message. */
               break;

            default:
               abort ();
            }
    }

    if (help_flag == 1) {
        printf("proper syntax is: yourProgram.exe [OPTIONS]... INFILE\n");
        printf("splits csv file into multiple files")
        printf("Option list: \n");
        printf("--help                  print help to screen\n");
        printf("\n");
        return 0;
    }

    //get the filename off the command line and redirect it to input
    //if there is no filename then use stdin

    if (optind < argc) {
        FILE *file;

        file = fopen(argv[optind], "r");
        if (!file) {
            fprintf (stderr, "%s: Couldn't open file %s; %s\n", argv[0], argv[optind], strerror (errno));
            exit(errno);
        }
        yyin = file;
        strcpy(inputName, argv[optind]);
    }
    else {
        printf("no input file set, using stdin. Press ctrl-c to quit");
        yyin = stdin;
        strcpy(inputName, "\b\b\b\b\bagainst stdin");
    }

    //set up initial file names

    strcpy(fileName, inputName);
    strncpy(unknownFileName, fileName, strlen(fileName)-4);
    strncpy(stuffFileName, fileName, strlen(fileName)-4);

    strcat(unknownFileName, "_UNKNOWN_1.csv");
    strcat(stuffFileName, "_STUFF_1.csv");

    //open files for writing

    yyout = stdout;
    yyTemp = malloc(sizeof(FILE));
    yyUnknown = fopen(unknownFileName,"w");
    yyTemp = &(*yyUnknown);

    yyStuff = fopen(stuffFileName,"w");

    yylex();

    //close open files

    fclose(yyUnknown);

    printf("Lexer finished running %s",fileName);

    return 0;

}

要构建此flex程序，请安装flex，并使用此makefile（调整路径）：

TARGET = project.exe
TESTBUILD = project
LEX = flex
LFLAGS = -Cf
CC = i586-mingw32msvc-gcc
CFLAGS = -O -Wall 
INSTALLDIR = /mnt/J/Systems/executables

.PHONY: default all clean install uninstall cleanall

default: $(TARGET)

all: default install

OBJECTS = $(patsubst %.l, %.c, $(wildcard *.l))

%.c: %.l
    $(LEX) $(LFLAGS) -o $@ $<

.PRECIOUS: $(TARGET) $(OBJECTS)

$(TARGET): $(OBJECTS)
    $(CC) $(OBJECTS) $(CFLAGS) -o $@

linux: $(OBJECTS)
    gcc $(OBJECTS) $(CFLAGS) -lm -g -o $(TESTBUILD)

cleanall: clean uninstall

clean:
    -rm -f *.c
    -rm -f $(TARGET)
    -rm -f $(TESTBUILD)

uninstall:
    -rm -f $(INSTALLDIR)/$(TARGET)

install:
    cp -f $(TARGET) $(INSTALLDIR)

Answer 4

反转结构：读入文件，然后遍历规则，这样你只能在各行上进行匹配。

regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
while read line ; do 
 while read rule
    # Each $line in the looped-through file contains a regex rule, e.g.,
    # egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
    # $rname is a unique rule name generated by a separate bash function
    # exported to the current shell.
        do
        cmd=" echo $line  | $rule  >> ~/outputdir/$tmp.$rname.filter.piped &"
        eval $cmd
    done < $regex_rules
done < $tmp

完成

此时虽然你可以/应该使用bash（或perl＆＃39; s）内置的正则表达式匹配，而不是让它为每个匹配启动一个单独的egrep进程。您也可以拆分文件并运行并行进程。（注意我也更正了＆gt;到＆gt;＆gt;）

Answer 5

我还决定回到这里写一个perl版本，然后才注意到amon已经完成了它。既然它已经写好了，这是我的：

#!/usr/bin/perl -W
use strict;

# The search spec file consists of lines arranged in pairs like this:
# file1
# [Ff]oo
# file2
# [Bb]ar
# The first line of each pair is an output file. The second line is a perl
# regular expression. Every line of the input file is tested against all of
# the regular expressions, so an input line can end up in more than one
# output file if it matches more than one of them.

sub usage
{
        die "Usage: $0 search_spec_file [inputfile...]\n";
}

@ARGV or usage();

my @spec;

my $specfile = shift();
open my $spec, '<', $specfile or die "$specfile: $!\n";
while(<$spec>) {
        chomp;
        my $outfile = $_;
        my $regexp = <$spec>;
        chomp $regexp;
        defined($regexp) or die "$specfile: Invalid: Odd number of lines\n";
        open my $out, '>', $outfile or die "$outfile: $!\n";
        push @spec, [$out, qr/$regexp/];
}
close $spec;

while(<>) {
        for my $spec (@spec) {
                my ($out, $regexp) = @$spec;
                print $out $_ if /$regexp/;
        }
}

在bash中使用20g文件

5 个答案: