Web报废 - 使用python从页面中提取数据

时间:2017-02-07 16:16:53

标签: python-3.x web-scraping

这是我正在使用的代码。它返回一个空列表。难道不知道我做错了什么!

from urllib request import urlopen
import re

url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page
html = urlopen(url).read().decode('utf-8')# decoding

cite_year='<span class="citation_year">(.+?)</span>'# extract citation year
pattern = re.compile(cite_year) #compile
citation_year = re.findall(pattern, html) #store data into a variable

print(citation_year)# and print

1 个答案:

答案 0 :(得分:0)

为请求添加标头,我使用requestsbs4库:

import requests
import bs4
headers = {'User-Agent':'Mozilla/5.0'}
url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page
html = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(html.text, 'lxml')
year = soup.find(class_="citation_year").text
print(year)