Question

我是全新的，我从BeautifulSoup和Python开始，我希望得到一个没有任何HTML标签或其他非文本元素的全文结果。

有关信息，我使用的是HTML5文档。

我这样做了：

#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

html_content = urllib2.urlopen("http://www.demo.com/index.html")

soup = BeautifulSoup(html_content, "lxml")

# Synthax for Beautiful Soup 4.1.2 - NO WORK
# title = soup.find_all("h2", class_="boc2")

# Synthax for Beautiful Soup VS ??? - WORK FINE
# title = soup.find_all("h2", "boc1")

big_title = [h1.string for h1 in soup.find_all("h1", "headline")]
title = [h2.string for h2 in soup.find_all("h2", "boc1")]
aside_title = [h2.string for h2 in soup.find_all("h2", "boc2")]

print big_title, title, aside_title

raw_input()

我得到了这个：

[u'title in header headline'] [u'title in section boc1'] [u'title in aside boc2']

我会得到这个：

title in header headline
title in section boc1
title in aside boc2

Answer 1

你得到的是unicode字符串。虽然在抓取时，unicode是更好的选择，但如果你想摆脱u前缀，那么就这样做，

big_title = [str(h1.string) for h1 in soup.find_all("h1", "headline")]
title = [str(h2.string) for h2 in soup.find_all("h2", "boc1")]
aside_title = [str(h2.string) for h2 in soup.find_all("h2", "boc2")]

仅用于打印文本，打印列表的0th元素（因为每个列表中只有一个元素）。像，

print big_title[0]

Answer 2

好的......我找到了你。

试试这个：

...
print big_title[0], title[0], aside_title[0]

raw_input()

使用BeautifulSoup解析并获得全文结果

2 个答案: