我是全新的,我从BeautifulSoup和Python开始,我希望得到一个没有任何HTML标签或其他非文本元素的全文结果。
有关信息,我使用的是HTML5文档。
我这样做了:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
html_content = urllib2.urlopen("http://www.demo.com/index.html")
soup = BeautifulSoup(html_content, "lxml")
# Synthax for Beautiful Soup 4.1.2 - NO WORK
# title = soup.find_all("h2", class_="boc2")
# Synthax for Beautiful Soup VS ??? - WORK FINE
# title = soup.find_all("h2", "boc1")
big_title = [h1.string for h1 in soup.find_all("h1", "headline")]
title = [h2.string for h2 in soup.find_all("h2", "boc1")]
aside_title = [h2.string for h2 in soup.find_all("h2", "boc2")]
print big_title, title, aside_title
raw_input()
我得到了这个:
[u'title in header headline'] [u'title in section boc1'] [u'title in aside boc2']
我会得到这个:
title in header headline
title in section boc1
title in aside boc2
答案 0 :(得分:2)
你得到的是unicode字符串。虽然在抓取时,unicode是更好的选择,但如果你想摆脱u
前缀,那么就这样做,
big_title = [str(h1.string) for h1 in soup.find_all("h1", "headline")]
title = [str(h2.string) for h2 in soup.find_all("h2", "boc1")]
aside_title = [str(h2.string) for h2 in soup.find_all("h2", "boc2")]
仅用于打印文本,打印列表的0th
元素(因为每个列表中只有一个元素)。像,
print big_title[0]
答案 1 :(得分:1)
好的......我找到了你。
试试这个:
...
print big_title[0], title[0], aside_title[0]
raw_input()