如何使用selenium获取带有javascript呈现源代码的html

时间:2014-03-30 02:19:21

标签: javascript python selenium

我在一个网页上运行查询,然后我得到结果网址。如果我右键单击查看html源代码,我可以看到JS生成的html代码。如果我只是使用urllib,python就无法获取JS代码。所以我看到了一些使用硒的解决方案。这是我的代码:

from selenium import webdriver
url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2'
driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe')
driver.get(url)
print driver.page_source

>>> <html><head></head><body></body></html>         Obviously It's not right!!

这是我在右键单击窗口中需要的源代码,(我想要信息部分)

</script></div><div class="searchColRight"><div id="topActions" class="clearfix 
noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary"
href="Default.aspx?    _act=VitalSearchR ...... <<INFORMATION I NEED>> ... 
to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-microsoft-com:xslt">

        jQuery(document).ready(function() {
            jQuery(".ancestry-information-tooltip").actooltip({
href: "#AncestryInformationTooltip", orientation: "bottomleft"});
        });

===========所以我的问题是=============== 如何获取JS生成的信息?

6 个答案:

答案 0 :(得分:28)

您需要通过javascript获取文档,您可以使用seleniums execute_script函数

from time import sleep # this should go at the top of the file

sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print html

这将使<html>标记内的所有内容

答案 1 :(得分:8)

没有必要使用该解决方法,您可以改为使用:

driver = webdriver.PhantomJS()
driver.get('http://www.google.com/')
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

答案 2 :(得分:1)

我认为您在JavaScript呈现动态HTML之前获取源代码。

最初尝试在导航和获取页面源之间休息几秒钟。

如果这样可行,那么您可以更改为其他等待策略。

答案 3 :(得分:1)

您尝试<?php $decimal = "1.0"; $decimalToFloat = floatval($decimal) // It becomes 1.0 $decimalToFloat = number_format($decimal,2) // It becomes "1.00" !!! String not a float !!! // The result which i want is 1.00 not "1.00" 此浏览器完全支持重js代码尝试它我希望它适合您

答案 4 :(得分:0)

我遇到了同样的问题,最后通过desired_capabilities解决了。

from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyType

proxy = Proxy(
     {
          'proxyType': ProxyType.MANUAL,
          'httpProxy': 'ip_or_host:port'
     }
)
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS.copy()
proxy.add_to_capabilities(desired_capabilities)
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
driver.get('test_url')
print driver.page_source

答案 5 :(得分:0)

对于从Internet获取Javascript源代码,我也有同样的问题,我是根据Victory的建议解决了它。

*首先,execute_script

driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
#print(driver.page_source)

*第二,使用beautifulsoup解析html(您可以通过pip命令下载beautifulsoup)

 import bs4    #import beautifulsoup
 import re
 from time import sleep

 sleep(1)      #wait one second 
 root=bs4.BeautifulSoup(innerHTML,"lxml") #parse HTML using beautifulsoup
 viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})   #find the value which you need.

*第三,打印出所需的值

 for span in viewcount:
    print(span.string) 

*完整代码

from selenium import webdriver
import lxml

urls="http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2"

driver = webdriver.PhantomJS()


##driver=webdriver.Chrome()
driver.get(urls)
innerHTML = driver.execute_script("return document.body.innerHTML")
##print(driver.page_source)

import bs4
import re
from time import sleep

sleep(1)
root=bs4.BeautifulSoup(innerHTML,"lxml")
viewcount=root.find_all("span",attrs={'class':'short-view-count style-scope yt-view-count-renderer'})


for span in viewcount:
print(span.string)

driver.quit()