Question

我试图用这样的新闻标题抓取RSS：

<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &amp;#039;world&amp;#039;s most valuable biscuit&amp;#039;</title>

这实际上是我使用美丽的汤来刮掉它：

soup = BeautifulSoup(xml, 'xml')
start = soup.findAll('item')
for i in start:
    news, is_created = News.create_or_update(news_id,                                                  
    head_line=i.title.text.encode('utf-8').strip(),
    ...)

尽管如此，标题仍然是这样的：

Photo of iceberg that is believed to have sunk Titanic sold at auction for \xa321,000 alongside &#039;world&#039;s most valuable biscuit&#039;

将这些特殊字符转换为ASCII字符会更容易吗？

Answer 1

我终于相信找到了问题。上面的这些字符是XML中的转义HTML。真是一团糟。如果你看看Independent RSS，大多数游戏会受到影响。

所以这不是UTF8问题。在转换为UTF8之前，如何对上面标题中的任何html字符进行编码？

head_line=i.title.text.encode('utf-8').strip(),

我通过使用HTMLParser取消标题然后使用UTF8对其进行编码来解决它。 Marco的回答基本上是一样的。但html库对我不起作用。

head_line=HTMLParser.HTMLParser().unescape(i.title.text).encode('utf-8').strip(),

我不建议使用from_encoding='latin-1'因为它会导致其他问题。使用unescaping和encode('utf-8')的解决方案足以将£解码为\xa3，这是正确的Unicode字符。

Answer 2

对于您提供的示例，这对我有用：

from bs4 import BeautifulSoup
import html

xml='<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &amp;#039;world&amp;#039;s most valuable biscuit&amp;#039;</title>'
soup = BeautifulSoup(xml, 'lxml')
print(html.unescape(soup.get_text()))

html.unescape处理HTML实体。如果Beautiful Soup没有正确处理井号，您可能需要在创建BeautifulSoup对象时指定编码，例如

soup = BeautifulSoup(xml, "lxml", from_encoding='latin-1')

尽管utf8编码，一些字符无法识别

2 个答案: