Question

我正在尝试下载以下链接中的所有PDF文件。

首先，我尝试提取所有PDF链接的网址（this image中用红色括起来的链接）

from bs4 import BeautifulSoup
import urllib2 as ul

resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1")
soup = BeautifulSoup(resp, 'lxml')

f = open('url.txt', 'w')

for link in soup.find_all('a', href=True):

    f.write(str(link['href']) + '\n')

f.close()

----------------------------------------------------------------

<url.txt>
http://www.osa.org
#
https://www.osapublishing.org
#
#
#
#
/about.cfm

/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/conference.cfm?meetingid=5
/conference.cfm?meetingid=124
/conference.cfm?meetingid=56
/conference.cfm?meetingid=144&yr=2015
/conference.cfm?meetingid=153&yr=2015
/conference.cfm?meetingid=131&yr=2015
/conference.cfm?meetingid=174&yr=2015
/conference.cfm?meetingid=109&yr=2015
#global-nav
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/osadigitalarchive.cfm
/isp.cfm
http://imagebank.osa.org
/spotlight
/china/
#
/user
#
#
#
https://www.osapublishing.org
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
/
#
#
/user
#
#
/about.cfm
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/china/
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
http://imagebank.osa.org
/spotlight/
/china/
/about.cfm
/benefitslog.cfm
/contactus.cfm
#
/privacy.cfm
/termsofuse.cfm
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac
/privacy.cfm
http://www.osa.org/en-us/help/

但是，看起来我想提取的链接没有被提取出来我怎么能这样做？

Answer 1

您想要解决的所有PDF链接都不在HTML-Source中，而是通过＆＃39; https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1＆＃39;。

PDF链接正在由AJAX加载。

我猜你需要用POST和＆＃39;打开网址。正确的参数/ cookie集。例如：＆＃34; CFID = xxxxxxxx; CFTOKEN = XXXXXXXX; BIGipServerPubsWeb_HTTP = xxxxxxxxx.xxxxx.xxxx; _ga = GAx.x.xxxxxxxxxx.xxxxxxxxxx; _gat = 1＆＃34;

您的回复将采用JSON格式。对象将包括＆quot;结果[0] .data.has-pdf = true＆＃39;测试现有的PDF。链接看起来像：＆＃39; fn：doc（＆＃34; /oe/21/22/27371/oe-21-22-27371.xml"）/ article / front / article-meta / abstract / p＆＃39;，所以你需要将它们与PDF文件相匹配。

但我猜他们可能会有一些IP检查或其他安全措施，所以也许你无法通过POST从原来的任何域获取一些数据。只是一个猜测;）

如何提取网页上链接的URL

1 个答案: