Question

我正在使用python 3.x并使用Beautifulsoap进行爬行

我想学习如何使用JAVASCRIPT抓取网站

例如）

<link   href=<%="'mystyle.css?version="+ DateTime.Now.ToString("yyyyMMddhhmmss") +"'"%>   rel="stylesheet" type="text/css"/>

在此，

通常，我期待

<a id="ContentPlaceHolder1_btnDown"
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$btnDown','')">
<img src="/images/common/icon/icrobat.gif" alt="emememem"></a>

<a href="javascript:fn_FileDownLoad('NewsLetter/Attach/2016/12/KIPF_161111.pdf',
'_KIPF_161111.pdf');">KIPF_161111.pdf</a>

所以，我使用了URL，然后我得到了pdf文件。

但是，在第一个代码中

a href="/alal/blablabla.pdf"

url在哪里？

我以为我得了硒。所以，如果我使用a ~~~ .click（），我会得到关于我想要的pdf文件的网址

例如

"href = javascript:__doPostBack("ct100$ContentPlaceHolder1$btnDown','')"

href="javascript:fn_FileDownLoad('NewsLetter/Attach/2016/12/KIPF_161111.pdf',
'_KIPF_161111.pdf');">KIPF_161111.pdf</a>

右??

我很困惑。

Answer 1

我想这是您需要的方法：get_attribute()

用法是这样的：

from selenium import webdriver
driver = webdriver.PhantomJS("your phantomjs path")
driver.get("your target url")

#firstly locate the block you need by specifying the css attribute,
#then get its inner HTML code
html = driver.find_element_by_css_selector('...').get_attribute('innerHTML')

#or you can locate the block by the id attribute
html = driver.find_element_by_id('...').get_attribute('innerHTML')

如何使用JAVASCRIPT抓取网站

1 个答案: