我正在制作一个Python脚本,用于在Google上搜索一个字词并仅获取PDF链接。
我想抓住&#34;绿色&#34;标有<cite>
的搜索结果。它们不是链接,只是标题。
这是我到目前为止所做的:
from bs4 import BeautifulSoup
import requests
import re
url = "http://www.google.com/search?q=shakespeare+pdf"
get = requests.get(url).text
soup = BeautifulSoup(get)
pdf = re.compile(r"\.(pdf)")
cite_pdfs = soup.find_all(pdf, class_="_Rm")
print cite_pdfs
但是,该列表仅返回[]
,即没有。
答案 0 :(得分:4)
这是一个很好的实现。我使用hdr request from urllib2来传递HTTP Error 403: Forbidden
from BeautifulSoup import BeautifulSoup
import urllib2
site= "http://www.google.com/search?q=shakespeare+pdf"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req).read()
soup = BeautifulSoup(page)
ka=soup.findAll('cite',attrs={'class':'_Rm'})
for i in ka:
print i.text
except urllib2.HTTPError, e:
print e.fp.read()
以下是结果,
davidlucking.com/documents/Shakespeare-Complete%20Works.pdf
www.artsvivants.ca/pdf/.../shakespeare_overvie...
www.folgerdigitaltexts.org/PDF/Ham.pdf
sparks.eserver.org/.../shakespeare-tempest.pdf
manybooks.net/.../shakespeetext94shaks12.htm...
www.w3.org/People/maxf/.../hamlet.pdf
www.adweek.com/.../free...shakespeare.../1868...
www.goodreads.com/ebooks/.../1420.Hamlet
calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf
www.freeclassicebooks.com/william_shakespea...