我正在使用调用页面时生成的大量javascript来抓取网站。因此,传统的网络抓取方法(beautifulsoup等)不能用于我的目的(至少我没有成功地让它们工作,所有重要数据都在javascript部分)。结果我开始使用selenium webdriver。我需要刮几百页,每页有10到80个数据点(每个有大约12个字段),所以重要的是这个脚本(正确的术语吗?)可以运行很长一段时间没有我照顾它。
我的代码适用于单个页面,我有一个控制部分,告诉抓取部分要抓取哪个页面。问题是,有时页面的javascript部分加载,有时候他们不加载(~1 / 7),刷新修复了一些东西,但偶尔刷新会冻结webdriver,因此python运行时环境就像好。令人讨厌的是,当它像这样冻结时,代码无法超时。发生了什么事?
以下是我的代码的精简版:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
#create HealthPlan namedtuple
HealthPlan = namedtuple( "HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
#check whether the page has loaded and handle page load and time out errors
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
# Handle page load error by testing presence of showAll,
# an important feature of the page, which only appears if everything else loads
try:
driver.find_element_by_xpath('//*[@id="showAll"]').text
# catch NoSuchElementException=>refresh page
except NoSuchElementException:
try:
driver.refresh()
# catch TimeoutException => quit and load the page
# in a new instance of firefox,
# I don't think the code ever gets here, because it freezes in the refresh
# and will not throw the timeout exception like I would like
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage() # this is a dummy function, everything from here down works fine,
我已经对类似的问题进行了广泛的研究,我认为其他任何人都没有发布过关于此问题的信息,或者我在其他地方发布的信息。我正在使用python 2.7,selenium 2.39.0,我正试图抓住Healthcare.gov的溢价估计页面
编辑:(作为一个例子,this page)值得一提的是,当计算机已经打开/执行一段时间时,页面无法更频繁地加载(我猜这是免费的) ram正在变满,并且加载时会出现故障)但这有点不合适,因为这应该由try / except来处理。
EDIT2:我还应该提一下,这是在windows7 64bit上运行,使用firefox 17(我认为是最新支持的版本)
答案 0 :(得分:2)
Dude,time.sleep这是一次失败!
这是什么?
time.sleep(3+ abs(random.normalvariate(1.8,3)))
试试这个:
class TestPy(unittest.TestCase):
def waits(self):
self.implicit_wait = 30
或者这个:
(self.)driver.implicitly_wait(10)
或者这个:
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('some_xpath'))
或者,你可以欺骗它来代替driver.refresh():
driver.get(your url)
你也可以点击饼干:
driver.delete_all_cookies()
scrapePage() # this is a dummy function, everything from here down works fine, :
<强> http://scrapy.org 强>