Question

我正在尝试使用list of hundreds of common misspellings清除一些输入，然后再搜索重复项。

这是一个时间要求严格的过程，所以我希望有比数百个正则表达式更快的方法（或者有一百个正则表达式）。

有没有一种有效的方法可以在Ruby中执行数百个文本替换？

Answer 1

另一种方法是，如果您的输入数据是分隔的单词，则只需构建一个{error => correction}的哈希表。

散列表查找是快速，因此如果您可以将输入数据弯曲为此格式，那么它几乎肯定足够快。

Answer 2

我很高兴地说我刚发现“RegexpTrie”这是代码的可用替代品，需要Perl的Regexp :: Assemble。

安装它，试一试：

require 'regexp_trie'

foo = %w(miss misses missouri mississippi)

RegexpTrie.union(foo)
# => /miss(?:(?:es|ouri|issippi))?/

RegexpTrie.union(foo, option: Regexp::IGNORECASE)
# => /miss(?:(?:es|ouri|issippi))?/i

这是输出的比较。数组中的第一个注释输出来自Regexp :: Assemble，尾随输出来自RegexpTrie：

require 'regexp_trie'

[
  'how now brown cow',                           # /(?:[chn]ow|brown)/
  'the rain in spain stays mainly on the plain', # /(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)/
  'jackdaws love my giant sphinx of quartz',     # /(?:jackdaws|quartz|sphinx|giant|love|my|of)/
  'fu foo bar foobar',                           # /(?:f(?:oo(?:bar)?|u)|bar)/
  'ms miss misses missouri mississippi'          # /m(?:iss(?:(?:issipp|our)i|es)?|s)/
].each do |s|
  puts "%-43s # /%s/" % [s, RegexpTrie.union(s.split).source]
end

# >> how now brown cow                           # /(?:how|now|brown|cow)/
# >> the rain in spain stays mainly on the plain # /(?:the|rain|in|s(?:pain|tays)|mainly|on|plain)/
# >> jackdaws love my giant sphinx of quartz     # /(?:jackdaws|love|my|giant|sphinx|of|quartz)/
# >> fu foo bar foobar                           # /(?:f(?:oo(?:bar)?|u)|bar)/
# >> ms miss misses missouri mississippi         # /m(?:iss(?:(?:es|ouri|issippi))?|s)/

关于如何使用维基百科链接和拼写错误的单词：

require 'nokogiri'
require 'open-uri'
require 'regexp_trie'

URL = 'https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines'

doc = Nokogiri::HTML(open(URL))
corrections = doc.at('div#mw-content-text pre').text.lines[1..-1].map { |s|
  a, b = s.chomp.split('->', 2)
  [a, b.split(/,\s+/) ]
}.to_h
#  {"abandonned"=>["abandoned"],
#   "aberation"=>["aberration"],
#   "abilityes"=>["abilities"],
#   "abilties"=>["abilities"],
#   "abilty"=>["ability"],
#   "abondon"=>["abandon"],
#   "abbout"=>["about"],
#   "abotu"=>["about"],
#   "abouta"=>["about a"],
#   ...
#   }

misspelled_words_regex = /\b(?:#{RegexpTrie.union(corrections.keys, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:a(?:b(?:andonned|eration|il(?:ityes|t(?:ies|y))|o(?:ndon(?:(?:ed|ing|s))?|tu|ut(?:it|the|a)...

此时您可以使用gsub(misspelled_words_regex, corrections)，但corrections中的值包含一些数组，因为可能已使用多个单词或短语来替换拼写错误的单词。您必须做一些事情来确定要使用哪个选项。

Ruby缺少一个在Perl中找到的非常有用的模块，名为Regexp::Assemble。 Python有hachoir-regex，它似乎做了同样的事情。

Regexp :: Assemble基于单词列表和简单表达式创建一个非常有效的正则表达式。这真的很了不起......还是......恶魔般的？

查看模块的示例;它的基本形式非常简单：

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])

注意它是如何组合模式的。它对常规单词也是如此，只有它会更有效，因为常见的字符串将被组合：

use Regexp::Assemble;

my $lorem = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.';

my $ra = Regexp::Assemble->new('flags' => 'i');

$lorem =~ s/[^a-zA-Z ]+//g;

$ra->add(split(' ', lc($lorem)));
print $ra->anchor_word(1)->as_string, "\n";

哪个输出：

\b(?:a(?:dipisicing|liqua|met)|(?:consectetu|tempo)r|do(?:lor(?:emagna)?)?|e(?:(?:li)?t|iusmod)|i(?:ncididunt|psum)|l(?:abore|orem)|s(?:ed|it)|ut)\b

此代码忽略大小写并尊重单词边界。

我建议编写一个可以获取单词列表的小Perl应用程序，并使用该模块输出正则表达式模式的字符串化版本。您应该能够将该模式导入Ruby。那会让你很快找到拼写错误的单词。您甚至可以将模式输出到YAML文件，然后将该文件加载到Ruby代码中。定期解析拼写错误的单词页面，通过Perl代码运行输出，Ruby代码将有更新模式。

您可以对一大块文本使用该模式，以查看是否存在拼写错误的单词。如果是这样，那么你将文本分解为句子或单词并再次检查正则表达式。不要立即对单词进行测试，因为大多数单词都拼写正确。这几乎就像对你的文本的二元搜索 - 测试整个事情，如果有一个点击然后打破较小的块来缩小搜索范围，直到找到个别的拼写错误。如何分解块取决于传入文本的数量。正则表达式模式可以测试整个文本块并返回一个零或索引值，除了单个单词的方式相同，因此您可以获得大量文本的大块速度。

然后，如果您知道拼写错误的单词，则可以对正确的拼写进行哈希查找。这将是一个很大的问题，但是筛选好的与坏的拼写的任务是花费最长的时间。查找速度非常快。

以下是一些示例代码：

get_words.rb

#!/usr/bin/env ruby

require 'open-uri'
require 'nokogiri'
require 'yaml'

words = {}
['0-9', *('A'..'Z').to_a].each do |l|
  begin
    print "Reading #{l}... "
    html = open("http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/#{l}").read
    puts 'ok'
  rescue Exception => e
    puts "got \"#{e}\""
    next
  end

  doc = Nokogiri::HTML(html)
  doc.search('div#bodyContent > ul > li').each do |n| 
    n.content =~ /^(\w+) \s+ \(([^)]+)/x
    words[$1] = $2 
  end
end

File.open('wordlist.yaml', 'w') do |wordfile|
  wordfile.puts words.to_yaml
end

regex_assemble.pl

#!/usr/bin/env perl

use Regexp::Assemble;
use YAML;

use warnings;
use strict;

my $ra = Regexp::Assemble->new('flags' => 'i');

my %words = %{YAML::LoadFile('wordlist.yaml')};
$ra->add(map{ lc($_) } keys(%words));

print $ra->chomp(1)->anchor_word(1)->as_string, "\n";

运行第一个，然后运行第二个管道输出到文件以捕获发出的正则表达式。

更多单词和生成输出的例子：

'how now brown cow' => /\b(?:[chn]ow|brown)\b/
'the rain in spain stays mainly on the plain' => /\b(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)\b/
'jackdaws love my giant sphinx of quartz' => /\b(?:jackdaws|quartz|sphinx|giant|love|my|of)\b/
'fu foo bar foobar' => /\b(?:f(?:oo(?:bar)?|u)|bar)\b/
'ms miss misses missouri mississippi' => /\bm(?:iss(?:(?:issipp|our)i|es)?|s)\b/

Ruby的Regexp.union与Regexp::Assemble的复杂程度无关。捕获拼写错误的单词列表后，有4225个单词，由41,817个字符组成。在对该列表运行Perl的Regexp :: Assemble之后，生成了一个30,954个字符的正则表达式。我会说这很有效率。

Answer 3

反过来尝试一下。不是纠正拼写错误并检查结果上的重复项，而是将所有内容删除为类似声音的格式（如Metaphone或Soundex），并检查该格式的重复内容。

现在，我不知道哪种方式可能更快 - 一方面，你有数百个正则表达式，每个正则表达不会立即匹配并返回。另一方面，你有30多个潜在的正则表达式替换，其中一个或两个肯定匹配每个单词。

现在，metaphone速度非常快 - 算法真的不多 - 所以我只能建议您尝试一下并测量是否足够快以供您使用。

有没有一种有效的方法在Ruby中执行数百个文本替换？

3 个答案: