Question

我正在使用NLTK在我的文本文件上执行kmeans聚类，其中每行被视为文档。例如，我的文本文件是这样的：

属于手指死亡拳仓促的迈克仓促墙壁jericho
jägermeister规则
规则乐队跟随表演jägermeister舞台
方法

现在我试图运行的演示代码是：https://gist.github.com/xim/1279283

我收到的错误是：

Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

这里发生了什么？

Answer 1

该文件被读作一堆str个，但它应该是unicode个。 Python试图隐式转换，但失败了。变化：

job_titles = [line.strip() for line in title_file.readlines()]

将str显式解码为unicode（此处假设为UTF-8）：

job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

也可以通过导入the codecs module并使用codecs.open而不是内置open来解决。

Answer 2

这对我来说很好。

f = open(file_path, 'r+', encoding="utf-8")

您可以添加第三个参数编码以确保编码类型为'utf-8'

注意：此方法在Python3中运行正常，我没有在Python2.7中尝试。

Answer 3

你也可以试试这个：

import sys
reload(sys)
sys.setdefaultencoding('utf8')

Answer 4

对我来说，终端编码存在问题。将UTF-8添加到.bashrc解决了这个问题：

export LC_CTYPE=en_US.UTF-8

不要忘记之后重新加载.bashrc：

source ~/.bashrc

Answer 5

尝试在Docker容器中安装python软件包时遇到此错误。对我而言，问题在于Docker映像未配置locale。将以下代码添加到Dockerfile中为我解决了这个问题。

# Avoid ascii errors when reading files in Python
RUN apt-get install -y locales && locale-gen en_US.UTF-8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en' LC_ALL='en_US.UTF-8'

Answer 6

要查找与任何和所有unicode错误相关的信息，请执行以下命令：

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

发现了我的

/etc/letsencrypt/options-ssl-nginx.conf:        # The following CSP directives don't use default-src as

使用shed，我发现了有问题的序列。原来是编辑器错误。 00008099: C2 194 302 11000010 00008100: A0 160 240 10100000 00008101: d 64 100 144 01100100 00008102: e 65 101 145 01100101 00008103: f 66 102 146 01100110 00008104: a 61 097 141 01100001 00008105: u 75 117 165 01110101 00008106: l 6C 108 154 01101100 00008107: t 74 116 164 01110100 00008108: - 2D 045 055 00101101 00008109: s 73 115 163 01110011 00008110: r 72 114 162 01110010 00008111: c 63 099 143 01100011 00008112: C2 194 302 11000010 00008113: A0 160 240 10100000

Answer 7

您可以在使用job_titles字符串之前尝试此操作：

source = unicode(job_titles, 'utf-8')

Answer 8

在使用 Python3.6 的Ubuntu 18.04上，我同时解决了以下问题：

with open(filename, encoding="utf-8") as lines:

，如果您以命令行方式运行该工具：

export LC_ALL=C.UTF-8

请注意，如果您使用的是 Python2.7 ，则必须以不同的方式进行处理。首先，您必须设置默认编码：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

然后要加载文件，必须使用io.open设置编码：

import io
with io.open(filename, 'r', encoding='utf-8') as lines:

您仍然需要导出环境

export LC_ALL=C.UTF-8

Answer 9

对于python 3，默认编码为＆＃34; utf-8＆＃34;。基本文档中建议执行以下步骤：https://docs.python.org/2/library/csv.html#csv-examples以防出现任何问题

创建一个功能

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

然后使用阅读器内部的功能，例如

csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))

Answer 10

只需执行以下-------------------------

执行open(fn, 'rb').read().decode('utf-8')而不只是open(fn).read()

Answer 11

python3x或更高版本

以字节流加载文件：

body ='' 对于open（'website / index.html'，'rb'）中的行： codedLine = lines.decode（'utf-8'）身体=身体+ decodedLine.strip（）返回正文
使用全局设置：

导入io 导入系统 sys.stdout = io.TextIOWrapper（sys.stdout.buffer，encoding ='utf-8'）

UnicodeDecodeError：'ascii'编解码器无法解码位置13中的字节0xe2：序数不在范围内（128）

11 个答案: