我正在处理大文件。我想知道,在你看来,当你想知道$ file1中的单词“x”是否存在于file2的句子“y”中时,处理大文件的最佳方法是什么。 我的文件有超过20000行..
示例:
这是第一个文件的内容:
eat
take
breath
you
alpha
这是第二个文件的内容:
eat,hungry
love,lovers
me,mine
take,taken,give
you,u,yo
fun,funny
这是我可能期望的第三个文件的内容
eat : eat,hungry
take : take,taken,give
you : you,u,yo
你可以看到,我想在第一个文件的第二个文件中找到匹配的表达式。
我的解决方案 - 但循环永远不会结束 -
解决方法1:
$file1= "words.txt";
$file2 = "expressions.txt";
$out = "out.txt";
open (W, "<", $file1);
open (E, "<", $file2);
open (OUT, ">", $out);
while(defined($l = <W>)){
@a = split (/\n/, $l);
push @w, @a;
}
while(defined($l2 = <E>)){
for ($i = 0; $i < @w; $i++){
if (grep /\Q\b$w[$i]\b\E/, $l2){ #or just /\b$w[$i]\b/
print OUT "$w[$i] : $l2\n";
}
}
}
溶液2:
$file1= "words.txt";
$file2 = "expressions.txt";
$out = "out.txt";
open (W, "<", $file1);
open (E, "<", $file2);
open (OUT, ">", $out);
while(defined($l = <W>)){
@a = split (/\n/, $l);
push @w, @a;
while(defined($l2 = <E>)){
@b = split (/\n/, $l2);
push @e, @b;
}
for ($k = 0; $k < @e; $k++){
for ($i = 0; $i < @w; $i++){
if (grep /\b$w[$i]\b/, $e[$k]){
print OUT "$w[$i] : $w[$l]\n";
}
}
}
答案 0 :(得分:1)
如何首先处理表达式文件以使字典将每个单词映射到一个句子,然后查找words.txt中的单词是否在字典中?我想这可能会更快。源代码如下:
#! /opt/VRTSperl/bin/perl
$words = "words.txt";
$expressions = "expressions.txt";
$out = "out.txt";
open (E, "<", $expressions);
open (W, "<", $words);
open (OUT, ">", $out);
my %dic;
while (my $sentence = <E>) {
chomp($sentence);
my @words = split(/,/, $sentence);
foreach my $word (@words) {
$dic{$word} .= "$sentence";
}
}
while (my $word = <W>) {
chomp($word);
if ($dic{$word}) {
print OUT "$word : $dic{$word}\n"
}
}
答案 1 :(得分:1)
#!/usr/local/bin/perl
use strict;
use warnings;
open (my $fh, "<", "f1.txt") or die $!;
open (my $fh2, "<", "f2.txt") or die $!;
my @keys;
while(chomp(my $line = <$fh>)){
push @keys, $line;
}
while(chomp(my $line2 = <$fh2>)){
foreach (@keys){
if ($line2 =~ $_){
print "$_ : $line2\n";
}
}
}
输出
eat : eat,hungry
take : take,taken,give
you : you,u,yo
答案 2 :(得分:1)
您尝试匹配文字\b
而不是字边界,所以
/\Q\b$w[$i]\b\E/
实际应该是
/\b\Q$w[$i]\E\b/