我正在尝试纠正拼写错误,为此,我将Spacy与Hunspell和Python结合使用。我编写了以下代码来查找“cardaço”的建议单词,这是用葡萄牙语编写“cadarço”的错误方式。
import hunspell
from spacy.tokens import Token
import spacy
class spaCyHunSpell(object):
name = 'spacy_hunspell'
def __init__(self, dic_path, aff_path):
self.hobj = hunspell.HunSpell(dic_path, aff_path)
Token.set_extension('hunspell_spell', default=None)
Token.set_extension('hunspell_suggest', getter=self.get_suggestion)
def __call__(self, doc):
for token in doc:
token._.hunspell_spell = self.hobj.spell(token.text)
return doc
def get_suggestion(self, token):
return self.hobj.suggest(token.text)
nlp = spacy.load('pt')
hunspell = spaCyHunSpell('/usr/share/hunspell/pt_BR.dic', '/usr/share/hunspell/pt_BR.aff')
nlp.add_pipe(hunspell)
doc = nlp(u'cardaço')
print(doc[0]._.hunspell_suggest)
我已经正确安装了所有库,并且上面的代码可以很好地用于“ feninine”一词。我的问题是“ç”。
我得到的错误是:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 5: invalid continuation byte
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/netshoes/PycharmProjects/migracao-sas/modelo_NICHO2/main.py", line 29, in <module>
print(doc[0]._.hunspell_suggest)
File "/usr/local/lib/python3.6/dist-packages/spacy/tokens/underscore.py", line 31, in __getattr__
return getter(self._obj)
File "/home/netshoes/PycharmProjects/migracao-sas/modelo_NICHO2/main.py", line 23, in get_suggestion
return self.hobj.suggest(token.text)
SystemError: <built-in method suggest of HunSpell object at 0x7f6b3560fe50> returned a result with an error set
我尝试使用unidecode失败。
我的Python版本是3.6
答案 0 :(得分:0)
请尝试将此代码放在文件的第一行
答案 1 :(得分:0)
如果有人卡住,请注意。有一个名为spacy_hunspell的软件包,它是Hunspell for Python和spaCy的包装。它使用了Hunspell python版本0.5.0
,它带有编码问题,就像该线程中提到的here一样。
要解决此问题,只需将spacy_hunspell的setup.py文件更改为hunspell==0.5.5
,即可解决问题。