Question

我正在对来自网络的字符串语料库执行一些NLP任务 - 正如您所料，存在编码问题。以下是一些例子：

they don’t serve sushi : the apostrophe in don't is not standard ' but \xe2\x80\x99
Delicious food – Wow   : the hyphen before wow is \xe2\x80\x93

所以现在，我将阅读这些行，将它们传递给NLTK进行解析，使用解析信息通过mallet训练CRF模型。

让我们从堆栈溢出处到处看到的解决方案开始吧。这是一些实验： -

st = "they don’t serve sushi"

st.encode('utf-8')
Out[2]: 'they don\xc3\xa2\xe2\x82\xac\xe2\x84\xa2t serve sushi'

st.decode('utf-8')
Out[3]: u'they don\u2019t serve sushi'

所以这些只是反复尝试，看看是否有效。

我终于使用了编码的句子并将其传递给下一部分 - 使用nltk进行pos标记。 posTags = nltk.pos_tag(tokens)它引发了一个众所周知的丑陋异常： -

 File "C:\Users\user\workspacePy\_projectname_\CRF\FeatureGen.py", line 95, in getSentenceFeatures
    posTags = nltk.pos_tag(tokens)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\__init__.py", line 101, in pos_tag
    return tagger.tag(tokens)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 61, in tag
    tags.append(self.tag_one(tokens, i, tags))
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 81, in tag_one
    tag = tagger.choose_tag(tokens, index, history)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 634, in choose_tag
    featureset = self.feature_detector(tokens, index, history)
  File "C:\Users\user\Anaconda\lib\site-packages\nltk\tag\sequential.py", line 736, in feature_detector
    'prevtag+word': '%s+%s' % (prevtag, word.lower()),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

当我尝试解码时，它在我正在解码字符串的行中显示UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 42: ordinal not in range(128)。

所以我目前的解决方案是删除所有非ascii字符。但它完全改变了导致基于unigram-bigram（单词组合）模型的数据严重丢失的单词。

什么是正确的方法？

Answer 1

在您的示例中，st是str（字节列表）。为此，它以某种形式编码（外观为utf8），但将其视为字节列表，您需要知道它是如何编码才能对其进行解码（尽管utf8通常是一个很好的第一个平底船）。

>>> st = "they don’t serve sushi"
>>> st
'they don\xe2\x80\x99t serve sushi'
>>> type(st)
<type 'str'>

>>> st.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

所以st.encode在这里是非感性的。它已经编码了（由解释器看起来像utf8一样）。出于某种疯狂的原因，在python2 str.encode中将首先decode转换为unicode，然后encode转回str。它默认选择解码为ascii，但您的数据编码为utf8。因此，您所看到的错误是在编码操作的解码步骤中！它正在查看字节列表e2,80,99并说 - ＆＃39;嗯，那些不是真正的ascii字符＆＃39;。

让我们从unicode数据开始（注意你）：

>>> st = u"they don’t serve sushi"
>>> st
u'they don\u2019t serve sushi'
>>> type(st)
<type 'unicode'>
>>> st.encode('utf8')
'they don\xe2\x80\x99t serve sushi'

真的，这一切都是python2的错。 Python3不会让你想到unicode和str同样的恶作剧。

经验法则是;始终在代码中使用unicode。只有在您将数据输入和输出系统时才进行编码/解码，并且通常编码为utf8，除非您有其他特定要求。

在python2中，您可以确保代码中的'data'自动为unicode u'data'

from __future__ import unicode_literals

>>> st = "they don’t serve sushi"
>>> st
u'they don\u2019t serve sushi'
>>> type(st)
<type 'unicode'>

Answer 2

这不是一个神奇的解决方案的简单问题。您可以在http://blog.luminoso.com/2012/08/20/fix-unicode-mistakes-with-python/了解更多相关信息 TL; DR 使用Python Unicode模块查找字符的类别，并假设单词不使用混合类别。

Answer 3

与@Aidan Kane的答案相关的总是对我有用的技巧是首先执行product short description，以便在你想要进行字符串操作时字符串是unicode，然后是{{ 1}}当你想要写出文件或什么的时候。

如何解决这个奇怪的python编码问题？

3 个答案: