Question

我正在构建一个创建UTF-8编码XML文件的EPG scraper。一切都很好，除了我编码所有字符串的麻烦，我拼凑成一个unicode字符串，我可以加载到我的文件。

我的代码是这样的：

starttime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[0].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')
endtime = datetime.strptime(' '.join([str(now.year).encode('UTF-8'), str(e[4].encode('UTF-8')), str(e[1].encode('UTF-8'))]), '%Y %a %d %b %I:%M%p').strftime('%Y%m%d%H%M%S')

global epg_data

clean_channel = str(channel.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e2 = str(e[2].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
clean_e3 = str(e[3].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))
div_list3 = div_list2.encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;')
e5 = str(e[5].encode('UTF-8').replace('&', '&amp;').replace("'", "&apos;").replace('"', '&quot;').replace('<', '&lt;').replace('>', '&gt;'))

epg_data = ''.join([u'<programme start="',starttime,u' +0100" stop="',endtime,u' +0100" channel="',clean_channel,u'">\n', \
u'<title lang="eng">',e5,u'</title>\n<desc lang="eng">',clean_e2,' ',clean_e3,u'</desc>\n<icon src="',div_list3,u'" />\n', \
u'<country>UK</country>\n</programme>'])

我在尝试解析以下内容时遇到问题（打印到IDLE）：

<programme start="20180514180500 +0100" stop="20180514190000 +0100" channel="BBC Entertainment">
<title lang="eng">Hustle</title>
<desc lang="eng">Hustle Tiger Troubles Season 6 Episode 3/6When a notorious hardman demands Â£500,000 from Albert by the end of the week, the team tries to raise the cash by targeting a playboy in possession of a gold tiger worth a vast amount of money. Emma is sent to persuade the owner to lend the item to a major museum, in the hope the gang can steal it, but an impenetrable vault causes complications. Guest starring former Doctor Who star Colin Baker and Lolita Chakrabarti : 8.2</desc>
<icon src="http://my.tvguide.co.uk/channel_logos/60x35/68.png" />
<country>UK</country>
</programme>

生成的错误如下：

Traceback (most recent call last):
  File "G:\Python27\Kodi\Sky TV Guide Scraper.py", line 332, in soup_to_text
    u'<country>UK</country>\n</programme>'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 75: ordinal not in range(128)

我有点失去了解决方法，所以任何帮助都会感激不尽。

由于

Answer 1

Unicode支持在python 2中相当混乱。这是转移到python的前50个理由3.将str或unicode编码为utf-8会返回{{1与常规ASCII字符串无法区分的对象。你只需要记住它的编码。 str有点多余（已经是str(channel.encode('utf-8'))，因此str部分是不必要的。

当您致电str(..)时，您混合了''.join([u'<programme start="', etc...])和unicode个对象，因此python尝试将所有内容宣传到str。您知道其中一些unicode字符串实际上是utf-8编码的字符串，但是python并不知道。 Python 3会知道并大声咆哮。

unicode的一般规则是在边缘进行转换。读东西时解码，写东西时编码。如果你跳过了str的东西，只是在你给的代码片段中坚持使用unicode，那就可以了。

要考虑的其他两件事：Python可以为您转义字符串。 encode('utf-8')适用于较旧的HTML。 cgi.escpae适用于XML和XHTML以及HTML5。 xml.sax.saxutils.escape有助于提高字符串格式的可读性。

全部放在一起......

str.format

尝试将字符串转换为unicode以加载UFT-8 XML文件

1 个答案: