Question

我有两个文件，wordlist.txt和text.txt。

第一个文件wordlist.txt包含大量中文，日文和韩文单词，例如：

你
你们
我

第二个文件text.txt包含长段落，例如：

你们要去哪里？
卡拉OK好不好？

我想创建一个新的单词列表（wordsfount.txt），但它应该只包含来自wordlist.txt的那些在text.txt内至少找到一次的行。上面的输出文件应该显示：

你
你们

在此列表中找不到“我”，因为在text.txt中找不到它。

我想找到一种非常快速的方法来创建此列表，该列表仅包含第二个文件中第一个文件中的行。

我知道BASH中有一种简单的方法可以检查worlist.txt中的每一行，并使用text.txt查看它是否在grep中：

a=1
while read line
do
    c=`grep -c $line text.txt`
    if [ "$c" -ge 1 ]
    then
    echo $line >> wordsfound.txt
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < wordlist.txt

不幸的是，由于wordlist.txt是一个很长的列表，因此这个过程需要很长时间。必须有一个更快的解决方案。这是一个考虑因素：

由于这些文件包含CJK字母，因此可以将它们视为一个包含大约8,000个字母的巨型字母。所以几乎每个单词都共享字符。 E.g：

我
我们

由于这个事实，如果在text.txt中找不到“我”，那么“我们”也永远不会出现。更快的脚本可能首先检查“我”，并且在发现它不存在时，将避免检查包含在wordlist.txt内的wordlist.txt所包含的每个后续单词。如果在wordlist.txt中找到大约8,000个唯一字符，则脚本不需要检查这么多行。

创建列表的最快方法是什么，该列表仅包含第一个文件中也在第二个文件中找到的那些单词？

Answer 1

我从Gutenberg项目中抓取the text of War and Peace并编写了以下脚本。如果打印/usr/share/dict/words中同样位于war_and_peace.txt的所有字词。您可以使用以下内容进行更改：

perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt

在我的电脑上，运行只需一秒钟。

use strict;
use warnings;
use utf8::all;

use Getopt::Long;

my $wordlist = '/usr/share/dict/words';
my $text     = 'war_and_peace.txt';

GetOptions(
    "worlist=s" => \$wordlist,
    "text=s"    => \$text,
);

open my $text_fh, '<', $text
    or die "Cannot open '$text' for reading: $!";

my %is_in_text;
while ( my $line = <$text_fh> ) {
    chomp($line);

    # you will want to customize this line
    my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
    next unless @words;

    # This beasty uses the 'x' builtin in list context to assign
    # the value of 1 to all keys (the words)
    @is_in_text{@words} = (1) x @words;
}

open my $wordlist_fh, '<', $wordlist
    or die "Cannot open '$wordlist' for reading: $!";

while ( my $word = <$wordlist_fh> ) {
    chomp($word);
    if ( $is_in_text{$word} ) {
        print "$word\n";
    }
}

这是我的时间：

• [ovid] $ wc -w war_and_peace.txt 
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt 

real    0m1.081s
user    0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt 
15277 wordsfound.txt

Answer 2

只需使用comm

http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html

comm -1 wordlist.txt text.txt

Answer 3

这可能对您有用：

 tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt

基本上，从text.txt创建一个新的单词列表，并将其与wordlist.txt文件进行对比。

N.B。您可能希望使用用于构建原始wordlist.txt的软件。在这种情况下，您只需要：

yoursoftware < text.txt > newwordlist.txt
grep -f newwordlist.txt wordlist.txt

Answer 4

使用带有固定字符串（-F）语义的grep，这将是最快的。同样，如果您想在Perl中编写它，请使用index function而不是正则表达式。

sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt

我很惊讶已经有四个答案，但还没有人发布这个答案。人们只是不知道他们的工具箱了。

Answer 5

确定不是最快的解决方案，但至少是一个有效的解决方案（我希望）。

此解决方案需要ruby 1.9，文本文件应为UTF-8。

#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')

new_wordlist = []
$wordlist.each{|word|
  new_wordlist << word if $txt.include?(word)
}

#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
  f << new_wordlist.join("\n")
}

你能提供一个更好的例子来对不同的方法做一些基准测试吗？（也许要下载一些测试文件？）

以下四种方法的基准测试。

#encoding: utf-8
require 'benchmark'
N = 10_000 #Number of Test loops

#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')

def solution_count
    new_wordlist = []
    $wordlist.each{|word|
      new_wordlist << word if $txt.count(word) > 0
    }
    new_wordlist.sort
end

#Faster then count, it can stop after the first hit
def solution_include
    new_wordlist = []
    $wordlist.each{|word|
      new_wordlist << word if $txt.include?(word)
    }
    new_wordlist.sort
end
def solution_combine()
    #get biggest word size
    max = 0
    $wordlist.each{|word| max = word.size if word.size > max }
    #Build list of all letter combination from text
    words_in_txt = []
    0.upto($txt.size){|i|
      1.upto(max){|l|
        words_in_txt << $txt[i,l]
      }
    }
    (words_in_txt & $wordlist).sort
end
#Idea behind:
#- remove string if found.
#- the next comparison is faster, the search text is shorter.
#
#This will not work with overlapping words.
#Example:
#  abcdef contains def.
#  if we check bcd first, the 'd' of def will be deleted, def is not detected.
def solution_gsub
    new_wordlist = []
    txt = $txt.dup  #avoid to manipulate data source for other methods
    #We must start with the big words.
    #If we start with small one, we destroy  long words
    $wordlist.sort_by{|x| x.size }.reverse.each{|word|
      new_wordlist << word if txt.gsub!(word,'')
    }
    #Now we must add words which where already part of longer words
    new_wordlist.dup.each{|neww|
      $wordlist.each{|word|          
        new_wordlist << word if word != neww and neww.include?(word)
      }
    }
    new_wordlist.sort
end

#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
  #~ f << solution_include.join("\n")
  f << solution_combine.join("\n")
}

#Check the different results
if solution_count != solution_include
  puts "Difference solution_count <> solution_include"
end
if solution_gsub != solution_include
  puts "Difference solution_gsub <> solution_include"
end
if solution_combine != solution_include
  puts "Difference solution_combine <> solution_include"
end

#Benchmark the solution
Benchmark.bmbm(10) {|b|

  b.report('count') { N.times { solution_count } }
  b.report('include') { N.times { solution_include } }
  b.report('gsub') { N.times { solution_gsub } } #wrong results
  b.report('combine') { N.times { solution_gsub } } #wrong results

} #Benchmark

我认为，solution_gsub变体不正确。请参阅方法定义中的注释。如果CJK可能允许这个解决方案，请给我一个反馈。这个变体在我的测试中是最慢的，但也许它会用更大的例子来调整。也许它可以调整一下。

变体combine也非常慢，但是如果用更大的例子会发生什么事情会有所不同。

Answer 6

我可能会使用Perl;

use strict;

my @aWordList = ();

open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);

while(my $sWord = <WORDLIST>)
{
   chomp($sWord);
   push(@aWordList, $sWord);
}

close(WORDLIST);

open(TEXT, "< text.txt") || die("Can't open text.txt);

while(my $sText = <TEXT>)
{
   foreach my $sWord (@aWordList)
   {
      if($sText =~ /$sWord/)
      {
          print("$sWord\n");
      }
   }
}


close(TEXT);

这不会太慢，但如果你能让我们知道你正在处理的文件的大小，我可以用哈希表写一些更聪明的东西

Answer 7

第一个TXR Lisp解决方案（http://www.nongnu.org/txr）：

(defvar tg-hash (hash)) ;; tg == "trigraph"

(unless (= (len *args*) 2)
  (put-line `arguments required: <wordfile> <textfile>`)
  (exit nil))

(defvar wordfile [*args* 0])

(defvar textfile [*args* 1])

(mapcar (lambda (line)
          (dotimes (i (len line))
            (push line [tg-hash [line i..(succ i)]])
            (push line [tg-hash [line i..(ssucc i)]])
            (push line [tg-hash [line i..(sssucc i)]])))
        (file-get-lines textfile))

(mapcar (lambda (word)
          (if (< (len word) 4)
            (if [tg-hash word]
              (put-line word))
            (if (find word [tg-hash [word 0..3]]
                      (op search-str @2 @1))
              (put-line word))))
        (file-get-lines wordfile))

这里的策略是将单词语料库缩减为哈希表，哈希表索引在行中出现的单个字符，有向图和三字图，将这些片段与行相关联。然后，当我们处理单词列表时，这会减少搜索工作量。

首先，如果单词很短，三个字符或更少（可能在中文单词中很常见），我们可以尝试在哈希表中获得即时匹配。如果不匹配，则单词不在语料库中。

如果单词超过三个字符，我们可以尝试匹配前三个字符。这给了我们一个包含三字符匹配的行列表。我们可以详尽地搜索这些行，看看它们中的哪些匹配。我怀疑这会大大减少必须搜索的行数。

我需要您的数据或其中的代表，才能看到行为是什么样的。

示例运行：

$ txr words.tl words.txt text.txt
water
fire
earth
the

$ cat words.txt
water
fire
earth
the
it

$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth

（TXR读取UTF-8并以Unicode进行所有字符串操作，因此使用ASCII字符进行测试是有效的。）

使用延迟列表意味着我们不会存储整个300,000个单词的列表。虽然我们使用的是Lisp mapcar函数，但是列表是动态生成的，因为我们没有保留对列表头部的引用，所以它有资格进行垃圾回收。

不幸的是，我们必须将文本语料库保留在内存中，因为哈希表会关联行。

如果这是一个问题，解决方案可以逆转。扫描所有单词，然后懒惰地处理文本语料库，标记出现的单词。然后消除其余部分。我也会发布这样的解决方案。

Answer 8

new file newlist.txt
for each word in wordlist.txt:
    check if word is in text.txt (I would use grep, if you're willing to use bash)
    if yes:
        append it to newlist.txt (probably echo word >> newlist.txt)
    if no:
        next word

Answer 9

使用bash脚本的最简单方法：

首先使用“tr”和“sort”进行预处理，将其格式化为一行，然后删除重复的行。
执行此操作：

cat wordlist.txt |读我的时候;做grep -E“^ $ i $”text.txt;完成;

这是你想要的单词列表......

Answer 10

试试这个： cat wordlist.txt |读线时做如果[[grep -wc $line text.txt -gt 0]] 然后 echo $ line 科幻完成

无论你做什么，如果你使用grep，你必须使用-w来匹配整个单词。否则，如果你在wordlist.txt中有foo，在text.txt中有foobar，你就会得到错误的匹配。

如果文件非常大，并且此循环需要花费太多时间来运行，您可以将text.txt转换为工作列表（使用AWK很容易），并使用comm查找两个列表中的单词。

Answer 11

此解决方案位于perl中，维护您的原始语义并使用您建议的优化。

#!/usr/bin/perl
@list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(@list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
    my $j = $i+1;
    while ($list[$j]=~/^$list[$i]/) {
            ++$j;
    }
    $skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
@text = (<IN>);
close IN;
foreach $c(@text) {
    for ($i=0;$i<$size;++$i) {
            if ($c=~/$list[$i]/) {
                    $found{$list[$i]}=1;
                    last;
            }
            else {
                    $i+=$skip[$i];
            }
    }
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
        print OUT "$key\n";
}
close OUT;
exit;

Answer 12

使用并行处理来加速处理。

1）排序＆amp;在wordlist.txt上使用uniq，然后将其拆分为多个文件（X）做一些测试，X与您的计算机核心相同。

 split -d -l wordlist.txt

2）使用xargs -p X -n 1 script.sh x00＆gt;输出x00.txt 处理paralel中的文件

 find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE

3）猫输出*＆gt; output.txt连接输出文件

这将加快处理速度，您可以使用您可以理解的工具。这将缓解主要的“成本”。

该脚本与您首先使用的脚本几乎完全相同。

script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
    c=`grep -c $line ${FILE} `
    if [ "$c" -ge 1 ]
    then
    echo $line >> ${OUTPUTFILE}
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < ${WORDLIST}

在第二个文件中没有匹配的文件中删除行的最快方法是什么？

12 个答案: