scrapy里面的scrapy不起作用

时间:2016-06-16 08:50:23

标签: python python-2.7 selenium selenium-webdriver scrapy

我有一个scrapy Crawlspider解析链接并返回html内容就好了。对于javascript页面,我邀请Selenium访问“隐藏的”#39;内容。问题是虽然Selenium在scrapy解析之外工作,但它在parse_items函数内部不起作用

from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistReviewItem
import scrapy
from selenium import selenium
from selenium import webdriver


class MySpider(CrawlSpider):
    name = "spidername"
    allowed_domains = ["XXXXX"]
    start_urls = ['XXXXX']

    rules = (
        Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'),
        Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True))

    def __init__(self):
        #this page loads 
        CrawlSpider.__init__(self)
        self.selenium = webdriver.Firefox()
        self.selenium.get('XXXXX')
        self.selenium.implicitly_wait(30)


    def parse_item(self, response):
        #this page doesnt
        print response.url
        self.driver.get(response.url)
        self.driver.implicitly_wait(30)

       #...do things

2 个答案:

答案 0 :(得分:1)

你有一些变数问题。在init方法中,您将浏览器实例分配给self.selenium,然后在方法 parse_item 中将self.driver用作浏览器实例。我已经更新了你的脚本。现在试试。

from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from craigslist_sample.items import CraigslistReviewItem
import scrapy
from selenium import selenium
from selenium import webdriver


class MySpider(CrawlSpider):
    name = "spidername"
    allowed_domains = ["XXXXX"]
    start_urls = ['XXXXX']

    rules = (
        Rule(LinkExtractor(allow = ('reviews\?page')),callback= 'parse_item'),
        Rule(LinkExtractor(allow=('.',),deny = ('reviews\?page',)),follow=True))

    def __init__(self):
        #this page loads 
        CrawlSpider.__init__(self)
        self.driver= webdriver.Firefox()
        self.driver.get('XXXXX')
        self.driver.implicitly_wait(30)


    def parse_item(self, response):
        #this page doesnt
        print response.url
        self.driver.get(response.url)
        self.driver.implicitly_wait(30)

       #...do things

答案 1 :(得分:0)

大!结合哈桑的答案和更好的知识我正在抓的网址导致答案(原来网站上种了'假的'网址从未加载过)