硒刮Javascript表

时间:2018-11-10 13:25:25

标签: python selenium

我正努力按照下面的代码进行抓取。如果有人可以看看我所缺少的东西,会不会很感激? 问候 PyProg70

from selenium import webdriver
from selenium.webdriver import FirefoxOptions
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from bs4 import BeautifulSoup
import pandas as pd
import re, time

binary = FirefoxBinary('/usr/bin/firefox')
opts = FirefoxOptions()
opts.add_argument("--headless")

browser = webdriver.Firefox(options=opts, firefox_binary=binary)
browser.implicitly_wait(10)

url = 'http://tenderbulletin.eskom.co.za/'
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, 'lxml')

print(soup.prettify())

1 个答案:

答案 0 :(得分:0)

不是Java,而是Javascript。该动态页面,您需要等待并检查Ajax是否完成了请求和使用WebDriverWait呈现的内容。

....
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

.....
browser.get(url)

# wait max 30 second until table loaded
WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR , 'table.CSSTableGenerator .ng-binding')))

html = browser.find_element_by_css_selector('table.CSSTableGenerator')
soup = BeautifulSoup(html.get_attribute("outerHTML"), 'lxml')
print(soup.prettify().encode('utf-8'))