这适用于Python 3.5.x. 我正在寻找的是在HTML代码
之后找到标题<h3 class = "title-link__title"><span class="title=link__text">News Here</span>
with urllib.request.urlopen('http://www.bbc.co.uk/news') as r:
HTML = r.read()
HTML = list(HTML)
for i in range(len(HTML)):
HTML[i] = chr(HTML[i])
我怎样才能得到它所以我只提取标题,因为这就是我所需要的。无论如何,我会尝试帮助细节。
答案 0 :(得分:1)
从网页中获取信息称为web scraping
。
完成这项工作的最佳工具之一是BeautifulSoup库。
from bs4 import BeautifulSoup
import urllib
#opening page
r = urllib.urlopen('http://www.bbc.co.uk/news').read()
#creating soup
soup = BeautifulSoup(r)
#useful for understanding the layout of your page info
#print soup.prettify()
#creating a ResultSet with all h3 tags that contains a class named 'title-link__title'
a = soup.findAll("h3", {"class":"title-link__title"})
#counting ocurrences
len(a)
#result = 44
#get text of first header
a[0].text
#result = u'\nMay v Leadsom to be next UK PM\n'
#get text of second header
a[1].text
#result = u'\nVideo shows US police shooting aftermath\n'