页面上有几个可点击的元素,我正在尝试抓一些页面,但我有这个错误,第一次点击后蜘蛛关闭了:
StaleElementReferenceException: Message: Element not found in the cache - perhaps the page has changed since it was looked up
现在我只是想打开页面来抓住新的网址。这是我的代码
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.xlib.pydispatch import dispatcher
from MySpider.items import MyItem
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import time
class MySpider(Spider):
name = "myspider"
allowed_domains = ["http://example.com"]
base_url = 'http://example.com'
start_urls = ["http://example.com/Page.aspx",]
def __init__(self):
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
self.driver.get(response.url)
item = MyItem()
links = self.driver.find_elements_by_xpath("//input[@class='GetData']")
for button in links:
button.click()
time.sleep(5)
source = self.driver.page_source
sel = Selector(text=source) # create a Selector object
item['url'] = self.driver.current_url
print '\n\nURL\n', item['url'], '\n'
yield item
答案 0 :(得分:2)
因为链接元素位于第一页。如果您打开新页面,链接元素将过时。
您可以尝试以下两种解决方案:
1,存储链接元素的链接网址,并使用driver.get(url)
打开链接。
def parse(self, response):
self.driver.get(response.url)
item = MyItem()
links = self.driver.find_elements_by_xpath("//input[@class='GetData']")
link_urls = links.get_attribute("href")
for link_url in link_urls:
self.driver.get(link_url)
time.sleep(5)
source = self.driver.page_source
sel = Selector(text=source) # create a Selector object
item['url'] = self.driver.current_url
print '\n\nURL\n', item['url'], '\n'
yield item
2,点击链接并获取网址后,请致电driver.back()
返回第一页。然后重新找到链接元素。
def parse(self, response):
self.driver.get(response.url)
item = MyItem()
links = self.driver.find_elements_by_xpath("//input[@class='GetData']")
for i in range(len(links)):
links[i].click()
time.sleep(5)
source = self.driver.page_source
sel = Selector(text=source) # create a Selector object
item['url'] = self.driver.current_url
print '\n\nURL\n', item['url'], '\n'
yield item
self.driver.back()
links = self.driver.find_elements_by_xpath("//input[@class='GetData']")