Question

在Mac上的Python 2.7中我打印使用nltk的PlaintextCorpusReader检索的文件名：

infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
    print fileid

并获取UnicodeDecodeError: 'ascii', '100316-N1-The \xc2\xa3250bn cost of developing.txt', 14, 15, 'ordinal not in range(128)'，因为文件名中包含£符号。

据我所知，fileid是一个unicode字符串，在打印之前我需要编码为默认编码，默认编码为ASCII。

如果我使用print fileid.encode('ascii', 'ignore')，我会收到相同的错误。

如果我通过在encoding = "utf-8"中设置site.py更改默认编码，（this advice）就可以了。

任何人都可以告诉我： （a）encode失败的原因（b）为什么encoding有效（c）如果我在这里做错了什么我该怎么办？（例如，this将默认编码设置为“丑陋的黑客”，导致滥用字符串和创建错误代码。）

（免责声明：Python新手，非常感谢您的耐心，如果这很明显的话）

=========================================== 更新以回应Rob：

Rob，这是测试代码的全文：

import sys
import os
from nltk.corpus import PlaintextCorpusReader

corpus_root = '/Users/richlyon/Documents/Filing/Infobase/'
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')

for fileid in infobasecorpus.fileids():
    print type(fileid)             # result <type 'str'>
    fileid = fileid.decode('utf8')
    print type(fileid)             # result <type 'unicode'>
    print fileid.encode('ascii')

我已将默认编码设置回ascii并运行它。

print fileid.encode('ascii')仍然在£文件名中失败。

=========================================== 最新更新，以防这对任何其他人有帮助。

我需要写：

fileid = fileid.decode('utf8')
print fileid.encode('ascii', 'ignore')

但如果text = nltk.Text(infobasecorpus.words(fileid))字符串被<type 'unicode'>字符串{{1}}扼流圈，这似乎与在进一步处理之前立即将所有内容转换为unicode的建议相矛盾。

但现在它有效。谢谢大家，特别是Rob。

Answer 1

检查fileid对象的类型。我怀疑它是不你建议的unicode对象。由于在python编码输出字符串（UnicodeDecodeError）之前的隐式解码，引发了print。

一旦字符串成功解码（到unicode），您就可以通过使用终端支持的编解码器对其进行显式编码来打印它。如果您的终端支持unicode显示，则可能不需要在输出之前对其进行编码。

infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
    fileid = fileid.decode('utf8') ## fileid is now a unicode object
    print fileid.encode('utf8')

将utf8替换为您的文件系统使用的任何编码（可能是Windows上的latin1，不确定）。

编辑：覆盖网站范围的默认编码被认为是黑客攻击，因为它可以隐藏编程问题，这可能意味着您的代码无法通过python安装移植，b）它可能会影响其他代码从同一个python安装运行。此外，明确关于编码和解码字符串会使您以后返回代码时的生活更轻松;您不必记住您修改了site.py

理解Python和UnicodeDecodeError中的unicode

1 个答案: