使用`requests`

Question

我正在尝试使用selenium和PhantomJS来抓取JavaScript生成的一些元素。

我的代码：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select

from bs4 import BeautifulSoup
from selenium import webdriver
from collections import OrderedDict
import time

driver = webdriver.PhantomJS()
driver.get('http://www.envirostor.dtsc.ca.gov/public/profile_report?global_id=01290021&starttab=landuserestrictions')

driver.find_element_by_id('sitefacdocsTab').click()
time.sleep(5)

html = driver.page_source
soup = BeautifulSoup(html)

点击操作后，我仍然得到旧页面数据，而不是jQuery提供的新数据。

Answer 1

使用`requests`

打开开发人员工具＆gt;网络＆gt;浏览器中的XHR选项卡。然后，单击Site/Facility Docs选项卡。您将在XHR选项卡中看到AJAX请求。该请求将发送至this site以获取标签数据。

只需使用requests模块即可从该标签中删除任何内容。

import requests

r = requests.get('http://www.envirostor.dtsc.ca.gov/public/profile_report_include?global_id=01290021&ou_id=&site_id=&tabname=sitefacdocs&orderby=&schorderby=&comporderby=&rand=0.07839738919075079&_=1521609095041')
soup = BeautifulSoup(r.text, 'lxml')

# And to check whether we've got the correct data:
table = soup.find('table', class_='display-v4-default')
print(table.find('a', target='_documents').text)
# Soil Management Plan Implementation Report, Public Market Infrastructure Relocation, Phase 1-B Infrastructure Area

使用`Selenium`

如果您要等待加载页面，则从不使用time.sleep()。您应该使用Eplicit Waits代替。使用后，您可以使用.get_attribute('innerHTML')属性获取整个标签内容。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get('http://www.envirostor.dtsc.ca.gov/public/profile_report?global_id=01290021&starttab=landuserestrictions')

driver.find_element_by_id('sitefacdocsTab').click()
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, 'docdatediv')))

html = driver.find_element_by_id('sitefacdocs').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table', class_='display-v4-default')
print(table.find('a', target='_documents').text)
# Soil Management Plan Implementation Report, Public Market Infrastructure Relocation, Phase 1-B Infrastructure Area

其他信息：

id="docdatediv"元素是包含日期范围过滤器的div标记。我已经使用过它，因为它不存在于第一个选项卡上，但存在于您想要的选项卡上。您可以将任何此类元素用于WebDriverWait。

并且，id="sitefacdocs"元素是div标记，其中包含整个标签内容（即日期过滤器和下面的所有表格）。因此，您的soup对象将包含所有这些内容。

Selenium Webscraping JavaScript元素

1 个答案:

使用`requests`

使用`Selenium`

Selenium Webscraping JavaScript元素

1 个答案:

使用requests

使用Selenium

使用`requests`

使用`Selenium`