Question

#!/usr/local/bin/perl
use strict;
use warnings;

use Text::SpellChecker;

my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );

while ( my $word = $checker->next_word ) {
    print "Bad word is $word\n";
}

输出：Bad word is rdinator

所需：Bad word is coördinator

如果$text中有Unicode，模块就会中断。不知道如何解决这个问题？

我安装了Aspell 0.50.5，这个模块正在使用它。我认为这可能是罪魁祸首。

修改：由于Text::SpellChecker需要Text::Aspell或Text::Hunspell，我删除了Text::Aspell并安装了Hunspell，Text::Hunspell，然后：

$ hunspell -d en_US -l < badword.txt
coördinator

显示正确的结果。这意味着我的代码或Text :: SpellChecker会出现问题。

考虑到米勒的建议我做了以下

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text =  "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
    print "Bad word is $word\n";
}

输出：

Flag is 1
Text is coördinator
Bad word is rdinator

这是否意味着模块无法正确处理utf8字符？

Answer 1

Text :: SpellChecker错误 - 当前版本仅假定ASCII字。

http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm

#
# next_word
# 
# Get the next misspelled word. 
# Returns false if there are no more.
#
sub next_word {
    ...
    while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {

恕我直言最佳解决方案是使用每种语言/区域设置单词拆分正则表达式或将单词拆分留给使用的底层库。 aspell list报告coördinator为单个字。

Answer 2

我incorporated Chankey的解决方案并将版本0.12发布到CPAN，试一试。

像coördinator这样的单词中的分音符的有效性很有趣。默认的aspell和hunspell词典似乎将其标记为不正确，但有些publications可能不同意。

最好的，布赖恩

Text :: SpellChecker模块和Unicode

2 个答案: