使用BeautifulSoup解析并获得全文结果

时间:2014-09-25 02:53:21

标签: python beautifulsoup

我是全新的,我从BeautifulSoup和Python开始,我希望得到一个没有任何HTML标签或其他非文本元素的全文结果。

有关信息,我使用的是HTML5文档。

我这样做了:

#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

html_content = urllib2.urlopen("http://www.demo.com/index.html")

soup = BeautifulSoup(html_content, "lxml")

# Synthax for Beautiful Soup 4.1.2 - NO WORK
# title = soup.find_all("h2", class_="boc2")

# Synthax for Beautiful Soup VS ??? - WORK FINE
# title = soup.find_all("h2", "boc1")

big_title = [h1.string for h1 in soup.find_all("h1", "headline")]
title = [h2.string for h2 in soup.find_all("h2", "boc1")]
aside_title = [h2.string for h2 in soup.find_all("h2", "boc2")]

print big_title, title, aside_title

raw_input()

我得到了这个:

[u'title in header headline'] [u'title in section boc1'] [u'title in aside boc2']

我会得到这个:

title in header headline
title in section boc1
title in aside boc2

2 个答案:

答案 0 :(得分:2)

你得到的是unicode字符串。虽然在抓取时,unicode是更好的选择,但如果你想摆脱u前缀,那么就这样做,

big_title = [str(h1.string) for h1 in soup.find_all("h1", "headline")]
title = [str(h2.string) for h2 in soup.find_all("h2", "boc1")]
aside_title = [str(h2.string) for h2 in soup.find_all("h2", "boc2")]

仅用于打印文本,打印列表的0th元素(因为每个列表中只有一个元素)。像,

print big_title[0]

答案 1 :(得分:1)

好的......我找到了你。

试试这个:

...
print big_title[0], title[0], aside_title[0]

raw_input()