Question

我正在Scrapy中构建一个简单的（ish）解析器，当涉及到scrapy和Python时，我很无知:-)在item.py文件中我有thisItem()的定义我指定到下面代码中的item。所有人都在游泳，parse使用回调来到parse_dir_content ...然后我意识到我需要刮掉额外的数据并创建另一个函数parse_other_content。如何将item中的内容添加到parse_other_content？

import scrapy
from this-site.items import *
import re
import json

class DmozSpider(scrapy.Spider):
     name = "ABB"
     allowed_domains = ["this-site.com.au"]
     start_urls = [
        "https://www.this-site.com.au?page=1",
        "https://www.this-site.com.au?page=2",
    ]

    def parse(self, response):
        for href in response.xpath('//h3/a/@href'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//h1[@itemprop="name"]'):
            item = thisItem()
            item['title'] = sel.xpath('text()').extract()
            item['rate'] = response.xpath('//div[@class="rate"]/div/span/text()').extract()
            so = re.search( r'\d+', response.url)
            propID = so.group()
            item['propid'] = propID
            item['link'] = response.url
            yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
            #yield item

    def parse_other_content(self, reponse):
            sel = json.loads(reponse.body)
            item['rate_detail'] = sel["this"][0]["that"]
            yield item

我知道我在这里缺少一些简单的东西，但我似乎无法弄明白。

Answer 1

根据scrapy文档（http://doc.scrapy.org/en/1.0/topics/request-response.html#topics-request-response-ref-request-callback-arguments）：

在某些情况下，您可能有兴趣将参数传递给那些回调函数，以便稍后在第二个回调中接收参数。您可以使用Request.meta属性。

在你的情况下，我会做这样的事情：

def parse_dir_contents(self, response):
    for sel in response.xpath('//h1[@itemprop="name"]'):
        item = thisItem()
        ...
        request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content)
        request.meta['item'] = item
        yield request

def parse_other_content(self, response):
    item = response.meta['item']
    # do something with the item
    return item

根据Steve（请参阅注释），您还可以将meta数据字典作为关键字参数传递给Request构造函数，如下所示：

def parse_dir_contents(self, response):
    for sel in response.xpath('//h1[@itemprop="name"]'):
        item = thisItem()
        ...
        request = scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content, meta={'item':item})
        yield request

Answer 2

您可以通过将item更改为parse_other_content()，或将其作为参数发送给该功能，以允许self.item显示self.。（第一个可能更容易。）

对于第一个解决方案，只需将def parse_dir_contents(self, response): for sel in response.xpath('//h1[@itemprop="name"]'): self.item = thisItem() self.item['title'] = sel.xpath('text()').extract() self.item['rate'] = response.xpath('//div[@class="rate"]/div/span/text()').extract() so = re.search( r'\d+', response.url) propID = so.group() self.item['propid'] = propID self.item['link'] = response.url yield scrapy.Request("https://www.this-site.com.au/something?listing_id="+propID,callback=self.parse_other_content) #yield item def parse_other_content(self, reponse): sel = json.loads(reponse.body) self.item['rate_detail'] = sel["this"][0]["that"] yield self.item添加到对项变量的任何引用中。这使整个班级都可以看到。

{{1}}

在函数之间传递类

2 个答案: