Python:UnicodeWarning:Unicode等同比较无法将两个参数都转换为Unicode - 将它们解释为不等

时间:2016-03-23 14:25:37

标签: python nltk

我试图使用NLTK在文本正文中对单词进行单词计数。我在文本文件中读取并尝试转换为小写,删除标点符号和标记化。然后删除停用词,然后计算最常用的单词。但是,我收到以下错误:

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

这是我的代码:

import nltk
import string
from nltk.corpus import stopwords
from collections import Counter

def get_tokens():
   with     open('/Users/user/Code/abstract/data/Training(3500)/3500_Response_Tweets.    txt', 'r') as r_tweets:
    text = r_tweets.read()
    lowers = text.lower()
    #remove the punctuation using the character deletion step of     translate
    no_punctuation = lowers.translate(None, string.punctuation)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

tokens = get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print count.most_common(100)

以及警告,我的输出如下:

[('so', 268), ('\xe2\x80\x8e\xe2\x80\x8fi', 231), ('like', 192), ('know', 157), ('dont', 137), ('get', 125), ('im', 122), ('would', 118), ('\xe2\x80\x8e\xe2\x80\x8fbut', 118), ('\xe2\x80\x8e\xe2\x80\x8foh', 114), ('right', 113), ('good', 105), ('\xe2\x80\x8e\xe2\x80\x8fyeah', 95), ('sure', 94), ('one', 92),

使用codecs.open时的回溯错误:

Traceback (most recent call last):
  File "tfidf.py", line 16, in <module>
    tokens = get_tokens()
  File "tfidf.py", line 12, in get_tokens
    no_punctuation = lowers.translate(None, string.punctuation)
TypeError: translate() takes exactly one argument (2 given)

1 个答案:

答案 0 :(得分:3)

我的建议:使用io.open('filename.txt', 'r', encoding='utf8')。然后你得到漂亮的unicode对象而不是丑陋的字节对象。

这适用于Python2和Python3。请参阅:https://stackoverflow.com/a/22288895/633961