我正在尝试将从xml文件中提取的字符串写入另一个文件(HTML),但是当我尝试运行该脚本时,它会给我这个错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 124: ordinal not in range(128)
这是Python代码:
f = open('web/tv.html', 'a')
counter = 0
for showname in os.listdir('xml/additional'):
tree = et.parse('xml/additional/%s/en.xml' % showname)
root = tree.getroot()
series = root.find('Series')
description = series.find('Overview').text
cell = '\n<tr><td>' + showname + '</td><td>' + description + '</td></tr>'
f.write(cell)
f.append(u'</table></div></body></html>')
这是XML文件的示例:
<Series>
<Overview>From Emmy Award-winner Dan Harmon comes "Community", a smart comedy series about higher education – and lower expectations. The student body at Greendale Community College is made up of high-school losers, newly divorced housewives, and old people who want to keep their minds active. Within these not-so-hallowed halls, Community focuses on a band of misfits, at the center of which is a fast-talkin' lawyer whose degree has been revoked, who form a study group and end up learning a lot more about themselves than they do about their course work.</Overview>
<other>stuff</other>
</Series>
有人可以告诉我我做错了什么吗?我发现Unicode非常复杂。
答案 0 :(得分:2)
您正在将Unicode与字节串混合; XML结果是Unicode值,其中包括en dash character。如果没有先编码,结果就无法写入纯文本文件。
使用以下代码将description
编码为ASCII文本
description = description.encode('ascii', 'xmlcharrefreplace')
将HTML实体用于ASCII以外的任何代码点:
>>> description = u'... a smart comedy series about higher education – and lower expectations.'
>>> description.encode('ascii', 'xmlcharrefreplace')
'... a smart comedy series about higher education – and lower expectations.'