Question

我的代码基于以下代码：https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

我可以使用较少数量的文件运行我的程序，但是当我开始获得大约1000的较大文件数时，我会收到此错误：

ReadWrite.py:59:UnicodeWarning：Unicode等同比较无法将两个参数都转换为Unicode - 将它们解释为不相等 stopped_tokens = [如果不是我在en_stop中，我在我的代币中]

我想知道是否有人之前遇到此问题，或者是否有人知道如何解决此错误。

Answer 1

您似乎正在尝试比较列表理解中不同类型的变量。 en_stop包含unicode变量。我猜，你正在从文件中读取的令牌，其编码类似于utf-8，cp1251等。你应该尝试确定你的令牌有什么样的编码。你可以这样做：

encoding = 'utf-8' # assign name like 'utf-8', 'cp1251', etc.
string = tokens[0]
try:
    string.decode(encoding)
    print 'string is {}'.format(encoding)
except UnicodeError:
    print 'string is not {}'.format(encoding)

当您找到正确的编码时，您可以通过以下方式获取stopped_tokens：

stopped_tokens = [i for i in tokens if not unicode(i, encoding) in en_stop]

unicode(i, encoding)应将您的令牌转换为列表解析中的unicode表示。

Python Latent Dirichlet分配Stopped_tokens错误

1 个答案: