Selenium无法提取页面源并返回html页面的空白正文

时间:2018-08-07 17:50:34

标签: python selenium selenium-webdriver web-scraping pagesource

这是我的python代码:

import pandas as pd
import pandas_datareader.data as web
import bs4 as bs
import urllib.request as ul

from selenium import webdriver
style.use('ggplot')
driver = webdriver.PhantomJS(executable_path='C:\\Phantomjs\\bin\\phantomjs.exe')
def getBondRate():
    #driver.deleteAllCookies();
    url = "https://www.marketwatch.com/investing/index/tnx?countrycode=xx"  

    driver.get(url)
    driver.implicitly_wait(10)
    html = driver.page_source
    return html
bondRate = getBondRate()
print(bondRate)

几天前,它从Market watch上阅读得很好。现在,它在Body标签中什么也不返回。硒不加载页面吗?

2 个答案:

答案 0 :(得分:0)

您还需要HTML标签吗?如果不是,您可以尝试使用body标签进行检索。这就是我使用Java的方式。

String src=driver.findElement(By.tagName("body")).getText();

答案 1 :(得分:0)

根据网址https://www.marketwatch.com/investing/index/tnx?countrycode=xx,您观察到的行为非常合理。

我已经处理了您的代码,并进行了一次简单的调整,尝试使用 PhantomJS ChromeDriver 提取page_source。可以看到,当您使用任何 WebDriver 变体时,都会检测到 WebDriver 指纹,并且出现 Fingerprinting error 提出如下:

  • 错误详细信息:

    Failed to load resource: the server responded with a status of 404 (Not Found)
    kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=058cbc6a-f8b8-f175-ca68-8c2e0fd6a4e3:1 Fingerprinting error 
      name: Error 
      message: Error issuing AJAX request (status code: 404) 
      stack: Error: Error issuing AJAX request (status code: 404)
        at XMLHttpRequest.N.a.onreadystatechange (https://www.marketwatch.com/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=058cbc6a-f8b8-f175-ca68-8c2e0fd6a4e3:1:1884)
    DevTools failed to parse SourceMap: https://www.marketwatch.com/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/fingerprint.js.map
    
  • DevTools快照:

fingerprintingerror