Question

我正在使用最新版本的spacy_hunspell和葡萄牙语的dictionaries。而且，我意识到，当我对包含特殊字符（例如尖音（`）和波浪号（〜））的动词进行了变形时，拼写检查器无法检索正确的验证：

URLConnection con = url.openConnection();
InputStream input = con.getInputStream();
while(input.available()>0)
{
        System.out.println(input.available());
        int n = input.available();
        byte[] mydataTab = new byte[n];
        input.read(mydataTab, 0, n);
        String str = new String(mydataTab);
        memoData += str;
}

另一个问题是动词不规则，例如import hunspell spellchecker = hunspell.HunSpell('/usr/share/hunspell/pt_PT.dic', '/usr/share/hunspell/pt_PT.aff') #Verb: fazer spellchecker.spell('fazer') # True, correct spellchecker.spell('faremos') # True, correct spellchecker.spell('fará') # False, incorrect spellchecker.spell('fara') # True, incorrect spellchecker.spell('farão') # False, incorrect #Verb: andar spellchecker.spell('andar') # True, correct spellchecker.spell('andamos') # True, correct spellchecker.spell('andará') # False, incorrect spellchecker.spell('andara') # True, correct #Verb: ouvir spellchecker.spell('ouvir') # True, correct spellchecker.spell('ouço') # False, incorrect：

ir

据注意到，具有特殊字符的名词不会发生此问题：

spellchecker.spell('vamos') # False, incorrect
spellchecker.spell('vai') # False, incorrect
spellchecker.spell('iremos') # True, correct
spellchecker.spell('irá') # False, incorrect

有什么建议吗？

Answer 1

这个问题是关于hunspell，而不是spacy或spacy_hunspell。

我认为这是一个编码问题，即使在您的所有测试用例中看起来都不一样。我不确定您如何找到这些葡萄牙语词典，但它们不在UTF-8中，也不是来自LibreOffice的当前/标准hunspell pt_PT库：

https://github.com/LibreOffice/dictionaries/tree/master/pt_PT

如果您安装了软件包hunspell-pt-pt（例如，使用apt-get install hunspell-pt-pt），这些是debian / ubuntu安装的葡萄牙语词典，并且它们在上述测试用例中具有正确的行为，或者在命令行或pyhunspell，如上面的代码所示。

Answer 2

要阐明一些重要的想法：拼写检查和定形通常是通过使用一组预定义规则来完成的（是的，机器学习，也没有广泛的带注释的同义词库）。但是，您已经注意到，其中一些规则不适用于不规则动词和屈曲。

事实证明，与其他语言相比，Spacy模型和规则（实际上不仅是伪造，而且还有葡萄牙语的任何工具）都非常薄弱。

结论：您没有得到错误的结果，因为犯了任何错误，而是因为 spacy（和hunspell）提供的模型是错误的。

作为一个开源项目，您可以尝试自己增强它。如果不是这样，您可以尝试使用其他工具，例如dicio（基于同义词库，但速度很慢，因为您必须将其与Ajax集成，并且每个单词都需要一个请求！）

欢迎使用葡萄牙语NLP！

Hunspell for Portuguese显示正确拼写的单词为拼写错误

2 个答案: