python UnicodeWarning:Unicode等同比较。如何解决这个错误?

时间:2015-01-19 11:49:05

标签: python unicode utf-8

herehere一样,我运行此代码:

with open(fin,'r') as inFile, open(fout,'w') as outFile:
  for line in inFile:
     line = line.replace('."</documents', '"').replace('. ', ' ')
     print(' '.join([word for word in line.lower().split() if len(word) >=3 and word not in stopwords.words('english')]), file = outFile)

我有以下错误:

**UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  print(' '.join([word for word in line.lower().split() if len(word) >=3 and word not in stopwords.words('english')]), file = outFile)**

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:3)

word not in stopwords.words('english')使用比较。 word中的stopwords.words('english')或至少一个值不是Unicode值。

由于您正在阅读文件,因此最有可能的候选人是word;对其进行解码,或使用在读取数据时解码数据的文件对象:

print(' '.join([word for word in line.lower().split()
                if len(word) >=3 and
                   word.decode('utf8') not in stopwords.words('english')]),
      file = outFile)**

import io

with io.open(fin,'r', encoding='utf8') as inFile,\
        io.open(fout,'w', encoding='utf8') as outFile:

io.open() function在文本模式下为您提供根据需要进行编码或解码的文件对象。

后者不易出错。例如,您测试word的长度,但您实际测试的是字节数。任何包含ASCII码点范围之外字符的单词都会导致每个字符有多个UTF-8字节,因此len(word)len(word.decode('utf8'))不同。