Question

我有一个chardet认为可能是windows1252的文件：

$ chardetect pub5.xml
pub5.xml: windows-1252 with confidence 0.73

当我尝试在Python中读取它然后将其写入CSV文件时，我收到错误：

str = row[r].decode('windows-1252').encode('utf8')

我得到的错误是：

Traceback (most recent call last):
  File "main.py", line 10, in <module>
    csv_filename='output/studies.csv'
  File "parser.py", line 15, in parse_data_to_csv
    self._write_csv_file(csv_header, csv_filename, xml_files)
  File "parser.py", line 114, in _write_csv_file
    str = row[r].decode('windows-1252').encode('utf8')
  File "/Users/me/.virtualenvs/test/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 6: ordinal not in range(128)

我不明白为什么需要ascii编解码器来解码windows-1252之外的内容。有人可以帮忙吗？

失败的字符串是：aa Mixéu 2002。相同的代码在Python控制台中正常工作：

str = 'aa Mixéu 2002'
str.decode('windows-1252').encode('utf8')
'aa Mix\xc3\x83\xc2\xa9u 2002'

我使用lxml设置row[r]的值，我不知道这是否相关：

studies = root.findall('.//STUDY')
for study_wrapper in studies:
    row = {}
    row['study_name'] = study_wrapper.get('NAME')

也许lxml以某种方式将其设置为ASCII？

更新：想出来：

try:
  row[r] = row[r].encode('utf-8')
except UnicodeDecodeError:
  row[r] = row[r].decode('ISO-8859-1').encode('utf-8')

似乎有些传入字符串是UTF8而有些则不是 - 来自同一个文件！

Answer 1

我强烈怀疑isinstance(row[r],unicode) == True

这会导致第一个decode中断，因为decode需要一个bytestring并返回unicode ...如果它已经有unicode，它会尝试使用默认的终端编码（通常是ascii）对其进行编码

尝试：row[r].encode("utf-8")

获取＆＃34; UnicodeEncodeError：＆＃39; ascii＆＃39;编解码器不能编码＆＃34;从windows-1252解码时？

1 个答案: