Python 2.7,unicode - 序数不在范围内

时间:2014-03-30 02:00:48

标签: python xml python-2.7 unicode

我正在尝试将从xml文件中提取的字符串写入另一个文件(HTML),但是当我尝试运行该脚本时,它会给我这个错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 124: ordinal not in range(128)

这是Python代码:

f = open('web/tv.html', 'a')
counter = 0
for showname in os.listdir('xml/additional'):
    tree = et.parse('xml/additional/%s/en.xml' % showname)
    root = tree.getroot()
    series = root.find('Series')
    description = series.find('Overview').text
    cell = '\n<tr><td>' + showname + '</td><td>' + description + '</td></tr>'
    f.write(cell)
f.append(u'</table></div></body></html>')

这是XML文件的示例:

<Series>
  <Overview>From Emmy Award-winner Dan Harmon comes &quot;Community&quot;, a smart comedy series about higher education – and lower expectations. The student body at Greendale Community College is made up of high-school losers, newly divorced housewives, and old people who want to keep their minds active. Within these not-so-hallowed halls, Community focuses on a band of misfits, at the center of which is a fast-talkin' lawyer whose degree has been revoked, who form a study group and end up learning a lot more about themselves than they do about their course work.</Overview>
  <other>stuff</other>
</Series>

有人可以告诉我我做错了什么吗?我发现Unicode非常复杂。

1 个答案:

答案 0 :(得分:2)

您正在将Unicode与字节串混合; XML结果是Unicode值,其中包括en dash character。如果没有先编码,结果就无法写入纯文本文件。

使用以下代码将description编码为ASCII文本

description = description.encode('ascii', 'xmlcharrefreplace')

将HTML实体用于ASCII以外的任何代码点:

>>> description = u'... a smart comedy series about higher education – and lower expectations.'
>>> description.encode('ascii', 'xmlcharrefreplace')
'... a smart comedy series about higher education &#8211; and lower expectations.'