Scrapy - 将蜘蛛称为其他脚本的方法

时间:2016-09-24 22:57:52

标签: scrapy scrapy-spider

我用parse()创建了这个类:

class PitchforkSpider(scrapy.Spider):
    name = "pitchfork_reissues"
    allowed_domains = ["pitchfork.com"]
    #creates objects for each URL listed here
    start_urls = [
                    "http://pitchfork.com/reviews/best/reissues/?page=1",
                    "http://pitchfork.com/reviews/best/reissues/?page=2",
                    "http://pitchfork.com/reviews/best/reissues/?page=3",
    ]

    def parse(self, response):

        items = []

        for sel in response.xpath('//div[@class="album-artist"]'):
            item = PitchforkItem()
            item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
            item['reissue'] = sel.xpath('//h2[@class="title"]/text()').extract()
            items.append(item)

        return items

从其他脚本,我导入上述module所属的class

from blogs.spiders.pitchfork_reissues_feed import *

并且,实例化class,我尝试调用parse()方法:

def reissues():

    pitchfork_reissues = PitchforkSpider()
    albums = pitchfork_reissues.parse(response)
    print (albums)

但是我收到以下错误:

    reissues = pitchfork_reissues.parse(response)
NameError: global name 'response' is not defined

显然,parse()方法需要scrapy.http.Response的实例。 如何在reissues()内的第二个脚本的上下文中创建此类实例?

1 个答案:

答案 0 :(得分:0)

from scrapy.http import Response

response = Response(body=u'html here')

现在我认为你不能以这种方式抓取,因为它不是Scrapy应该如何工作,但你仍然可以创建Response对象