我正在尝试下载以下链接中的所有PDF文件。
首先,我尝试提取所有PDF链接的网址(this image中用红色括起来的链接)
from bs4 import BeautifulSoup
import urllib2 as ul
resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1")
soup = BeautifulSoup(resp, 'lxml')
f = open('url.txt', 'w')
for link in soup.find_all('a', href=True):
f.write(str(link['href']) + '\n')
f.close()
----------------------------------------------------------------
<url.txt>
http://www.osa.org
#
https://www.osapublishing.org
#
#
#
#
/about.cfm
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/conference.cfm?meetingid=5
/conference.cfm?meetingid=124
/conference.cfm?meetingid=56
/conference.cfm?meetingid=144&yr=2015
/conference.cfm?meetingid=153&yr=2015
/conference.cfm?meetingid=131&yr=2015
/conference.cfm?meetingid=174&yr=2015
/conference.cfm?meetingid=109&yr=2015
#global-nav
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/osadigitalarchive.cfm
/isp.cfm
http://imagebank.osa.org
/spotlight
/china/
#
/user
#
#
#
https://www.osapublishing.org
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
/
#
#
/user
#
#
/about.cfm
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/china/
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
http://imagebank.osa.org
/spotlight/
/china/
/about.cfm
/benefitslog.cfm
/contactus.cfm
#
/privacy.cfm
/termsofuse.cfm
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac
/privacy.cfm
http://www.osa.org/en-us/help/
但是,看起来我想提取的链接没有被提取出来 我怎么能这样做?
答案 0 :(得分:2)
您想要解决的所有PDF链接都不在HTML-Source中,而是通过&#39; https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1&#39;。
PDF链接正在由AJAX加载。
我猜你需要用POST和&#39;打开网址。正确的参数/ cookie集。例如:&#34; CFID = xxxxxxxx; CFTOKEN = XXXXXXXX; BIGipServerPubsWeb_HTTP = xxxxxxxxx.xxxxx.xxxx; _ga = GAx.x.xxxxxxxxxx.xxxxxxxxxx; _gat = 1&#34;
您的回复将采用JSON格式。对象将包括&quot;结果[0] .data.has-pdf = true&#39;测试现有的PDF。链接看起来像:&#39; fn:doc(&#34; /oe/21/22/27371/oe-21-22-27371.xml")/ article / front / article-meta / abstract / p& #39;,所以你需要将它们与PDF文件相匹配。
但我猜他们可能会有一些IP检查或其他安全措施,所以也许你无法通过POST从原来的任何域获取一些数据。只是一个猜测;)