Question

我有两个文件，例如：

File1中：

partial
line3
someline2

文件2：

this is line3
this is partial
typo artial
someline2
someline

要求：

删除file2中包含file1中任何一行的所有行。
必须存在部分匹配，即file2中找到的file1行（不是完整行匹配）。
我正在寻找最有效的方法，我正在比较数百万行的文件。
可以使用linux上的任何工具/语言来实现。

预期结果：

typo artial
someline

我用python测试过但速度非常慢。也用grep测试，它几乎和python一样慢。

我比较的文件大小最多可达10GB。服务器端的内存不是问题，但我不想浪费资源。

根据答案测试结果：
用于测试的文件：

file1 with 7051 lines
file2 with 2182387 lines

使用grep：

# time grep -v -f file1 file2 > file3
real    28m50.078s
user    27m13.984s
sys     1m36.068s
# wc -l file3
1947790 file3

Grep with -F：

# time grep -v -F -f file1 file2 > file3
real    0m1.441s
user    0m1.400s
sys     0m0.040s
# wc -l file3
1950655 file3

使用Borodin发布的perl：

# time ./clean.pl > file3
real    0m2.281s
user    0m2.176s
sys     0m0.104s
# wc -l file3
1950655 file3

老实说，我没想到固定字符串会对grep产生如此大的影响。到目前为止grep赢了这个，将不得不测试10GB文件并计时。确保结果正确后。将返回更新。

更新

Perl赢了这个，因为我必须为一些特殊情况引入一些正则表达式。例如，我有一个包含域名的大文件，我想从其他文件中排除这些文件。但这意味着我需要域$作为正则表达式，否则google.co将匹配google.com并且它不正常。如果你没有像我对某些文件那样的特殊情况，那么grep是明显的性能赢家。

Answer 1

我想在linux系统上使用 grep 函数

<强>命令

grep -v -f File1 File2

-v：选择不匹配的行

-f：从FILE

获取PATTERN

您需要在终端上运行上述命令

<强>输出

typo artial
someline

Answer 2

最简单的方法是从file1.txt中的所有字符串构建正则表达式模式，并仅打印{em}中匹配的file2.txt中的那些文件图案

use strict;
use warnings 'all';

my $re = do {
    open my $fh, '<', 'file1.txt' or die $!;
    my @data = <$fh>;
    chomp @data;
    my $re = join '|', map quotemeta($_), @data;
    qr/$re/;
};

open my $fh, '<', 'file2.txt' or die $!;
/$re/ or print while <$fh>;

输出

typo artial
someline

Answer 3

使用散列是在一行中搜索字符串的好方法。它会提高您的程序速度。所以你可以尝试这种方式，看看你的程序速度是多少。我相信这会对你有所帮助。

 my $filename1 ="file1";
    my %myhash;
    open FH, $filename1 or die "Error\n";
    while(<FH>)
    {
            chomp($_);
            $myhash{$_}=1;

    }
    close(FH);
    my $filename2 = "file2";

    open FH1, $filename2 or die "Error\n";
    while(my $line =<FH1>)
    {
            chomp($line);
            my @arr= split(/\s/, $line);
            my $flag=0;
            foreach my $id (@arr)
            {
                    if(exists $myhash{$id})
                    {
                            $flag=1;
                    }
            }
            if($flag==0)
            {
                    print "$line\n";
            }


    }
    close(FH1);

比较两个文件并删除部分匹配的有效方法

3 个答案:

输出