我有一个scrapy Crawlspider解析链接并返回html内容就好了。对于javascript页面,我邀请Selenium访问“隐藏的”#39;内容。问题是虽然Selenium在scrapy解析之外工作,但它在parse_items函数内部不起作用
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistReviewItem
import scrapy
from selenium import selenium
from selenium import webdriver
class MySpider(CrawlSpider):
name = "spidername"
allowed_domains = ["XXXXX"]
start_urls = ['XXXXX']
rules = (
Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'),
Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True))
def __init__(self):
#this page loads
CrawlSpider.__init__(self)
self.selenium = webdriver.Firefox()
self.selenium.get('XXXXX')
self.selenium.implicitly_wait(30)
def parse_item(self, response):
#this page doesnt
print response.url
self.driver.get(response.url)
self.driver.implicitly_wait(30)
#...do things
答案 0 :(得分:1)
你有一些变数问题。在init方法中,您将浏览器实例分配给self.selenium,然后在方法 parse_item 中将self.driver用作浏览器实例。我已经更新了你的脚本。现在试试。
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistReviewItem
import scrapy
from selenium import selenium
from selenium import webdriver
class MySpider(CrawlSpider):
name = "spidername"
allowed_domains = ["XXXXX"]
start_urls = ['XXXXX']
rules = (
Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'),
Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True))
def __init__(self):
#this page loads
CrawlSpider.__init__(self)
self.driver= webdriver.Firefox()
self.driver.get('XXXXX')
self.driver.implicitly_wait(30)
def parse_item(self, response):
#this page doesnt
print response.url
self.driver.get(response.url)
self.driver.implicitly_wait(30)
#...do things
答案 1 :(得分:0)
大!结合哈桑的答案和更好的知识我正在抓的网址导致答案(原来网站上种了'假的'网址从未加载过)