Question

我有一个Python代码，试图阅读用西里尔字母（例如俄语）编写的RSS源代码。这是我使用的代码：

import feedparser
from urllib2 import Request, urlopen

d=feedparser.parse(source_url)

# Make a loop over the entries of the RSS feed.
for e in d.entries:
    # Get the title of the news.
    title = e.title
    title = title.replace(' ','%20')
    title = title.encode('utf-8')

    # Get the URL of the entry.
    url = e.link
    url = url.encode('utf-8')


    # Make the request. 
    address = 'http://example.org/save_link.php?title=' + title + '&source=' + source_name + '&url=' + url

    # Submit the link.
    req = Request(address)
    f = urlopen(req)

我使用encode('utf-8')，因为标题是用西里尔字母提供的，而且效果很好。 RSS源的示例是here。当我尝试从另一个URL读取RSS源列表时出现问题。更详细地说，有一个网页，其中包含RSS源列表（源的URL以及用西里尔字母给出的名称）。列表的一个例子是：

<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'http://www.w3.org/TR/html4/loose.dtd'>
<html>
<head>
<title></title>
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'>

ua, Корреспондент, http://k.img.com.ua/rss/ua/news.xml
ua, Українська Правда, http://www.pravda.com.ua/rss/

</body>
</html>

当我尝试将encode（'utf-8'）应用于本文档中给出的西里尔文字母时，会出现问题。我得到一个UnicodeDecodeError。有谁知道为什么？

Answer 1

encode只会向UnicodeDecodeError提供一个str对象然后尝试解码为unicode;见http://wiki.python.org/moin/UnicodeDecodeError。

您需要先将str对象解码为unicode：

name = name.decode('utf-8')

这将采用UTF-8编码的str并为您提供unicode个对象。

适用于您发布的代码，因为feedparser会将已解码的Feed数据返回到unicode。

为什么编码并不总是有效？

1 个答案: