提取和格式化站点数据Python

时间:2016-07-07 20:19:37

标签: python html web-scraping

这适用于Python 3.5.x. 我正在寻找的是在HTML代码

之后找到标题
<h3 class = "title-link__title"><span class="title=link__text">News Here</span>

with urllib.request.urlopen('http://www.bbc.co.uk/news') as r:
    HTML = r.read()
    HTML = list(HTML)
    for i in range(len(HTML)):
        HTML[i] = chr(HTML[i])

我怎样才能得到它所以我只提取标题,因为这就是我所需要的。无论如何,我会尝试帮助细节。

1 个答案:

答案 0 :(得分:1)

从网页中获取信息称为web scraping

完成这项工作的最佳工具之一是BeautifulSoup库。

from bs4 import BeautifulSoup
import urllib

#opening page
r = urllib.urlopen('http://www.bbc.co.uk/news').read()
#creating soup
soup = BeautifulSoup(r)

#useful for understanding the layout of your page info
#print soup.prettify()

#creating a ResultSet with all h3 tags that contains a class named 'title-link__title'
a = soup.findAll("h3", {"class":"title-link__title"})

#counting ocurrences
len(a)
#result = 44

#get text of first header
a[0].text
#result = u'\nMay v Leadsom to be next UK PM\n'

#get text of second header
a[1].text
#result = u'\nVideo shows US police shooting aftermath\n'