结合使用Hunspell和Python进行拼写无法处理带有符号的葡萄牙语单词

时间:2018-07-04 14:10:25

标签: python-3.x nlp spacy hunspell

我正在尝试纠正拼写错误,为此,我将Spacy与Hunspell和Python结合使用。我编写了以下代码来查找“cardaço”的建议单词,这是用葡萄牙语编写“cadarço”的错误方式。

import hunspell
from spacy.tokens import Token
import spacy

class spaCyHunSpell(object):
    name = 'spacy_hunspell'

    def __init__(self, dic_path, aff_path):
        self.hobj = hunspell.HunSpell(dic_path, aff_path)
        Token.set_extension('hunspell_spell', default=None)
        Token.set_extension('hunspell_suggest', getter=self.get_suggestion)

    def __call__(self, doc):
        for token in doc:
            token._.hunspell_spell = self.hobj.spell(token.text)
        return doc

    def get_suggestion(self, token):
        return self.hobj.suggest(token.text)

nlp = spacy.load('pt')
hunspell = spaCyHunSpell('/usr/share/hunspell/pt_BR.dic',     '/usr/share/hunspell/pt_BR.aff')
nlp.add_pipe(hunspell)
doc = nlp(u'cardaço')
print(doc[0]._.hunspell_suggest)

我已经正确安装了所有库,并且上面的代码可以很好地用于“ feninine”一词。我的问题是“ç”。

我得到的错误是:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 5: invalid continuation byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/netshoes/PycharmProjects/migracao-sas/modelo_NICHO2/main.py", line 29, in <module>
    print(doc[0]._.hunspell_suggest)
  File "/usr/local/lib/python3.6/dist-packages/spacy/tokens/underscore.py", line 31, in __getattr__
    return getter(self._obj)
  File "/home/netshoes/PycharmProjects/migracao-sas/modelo_NICHO2/main.py", line 23, in get_suggestion
    return self.hobj.suggest(token.text)
SystemError: <built-in method suggest of HunSpell object at 0x7f6b3560fe50> returned a result with an error set

我尝试使用unidecode失败。

我的Python版本是3.6

2 个答案:

答案 0 :(得分:0)

请尝试将此代码放在文件的第一行

--编码:utf-8--

答案 1 :(得分:0)

如果有人卡住,请注意。有一个名为spacy_hunspell的软件包,它是Hunspell for Python和spaCy的包装。它使用了Hunspell python版本0.5.0,它带有编码问题,就像该线程中提到的here一样。

要解决此问题,只需将spacy_hunspell的setup.py文件更改为hunspell==0.5.5,即可解决问题。