Question

我正在使用BeautifulSoup从Hacker News中提取新闻报道（只是标题）并且到目前为止还有这么多 -

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles_html =[]

    for td in soup.findAll("td", { "class":"title" }):
        titles_html += td.findAll("a")

    return titles_html

print get_stories(get_page()

）

但是，当我运行代码时，它会出错 -

Traceback (most recent call last):
  File "terminalHN.py", line 19, in <module>
    print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

如何让它发挥作用？

Answer 1

因为BeautifulSoup在内部使用unicode字符串。将unicode字符串打印到控制台将导致Python尝试将unicode转换为Python的默认编码，通常是ascii。对于非ascii网站，这通常会失败。您可以通过Google搜索“python + unicode”来学习有关Python和Unicode的基础知识。同时转换你的unicode字符串使用

到utf-8

print some_unicode_string.decode('utf-8')

Answer 2

关于你的代码需要注意的一点是findAll返回一个列表（在本例中是一个BeautifulSoup对象的列表），你只需要标题。您可能希望使用find代替。而不是打印出BeautifulSoup对象的列表，你说你只想要标题。以下工作正常，例如：

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles = []

    for td in soup.findAll("td", { "class":"title" }):
        a_element = td.find("a")
        if a_element:
            titles.append(a_element.string)

    return titles

print get_stories(get_page())

所以现在get_stories()会返回unicode个对象的列表，这些对象会按照您的预期打印出来。

Answer 3

它工作得很好，输出的坏处是什么。要么显式编码到控制台的字符集，要么找到一种不同的方式来运行代码（例如，从IDLE中）。

BeautifulSoup findall with class attribute- unicode encode error

3 个答案: