为什么项目的一些价值观?在这个Scrapy蜘蛛中重复的字段?

时间:2016-08-23 02:05:06

标签: python python-3.x scrapy scrapy-spider

当我的蜘蛛在this之类的网址上运行时:

def parse_subandtaxonomy(self, response):
item = response.meta['item']
for sub in response.xpath('//div[@class = "page-content"]/section'):
    item['Subcategory'] = sub.xpath('h2/text()').extract()
    for tax in sub.xpath('ul/li/a'):
        item['Taxonomy'] = tax.xpath('text()').extract()
        for href in tax.xpath('@href'):
            # url = response.urljoin(href.extract()) - > this gave me 301 redirects
            badurl = urljoin('https://211sepa.org/search/', href.extract())
            url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
            request = scrapy.Request(url, callback=self.parse_listings)
            request.meta['item'] = item
            yield item

我收到了这个输出,这是我所期望的:

{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Section 8 Vouchers"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Public Housing"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Low Income/ Subsidized Rental Housing"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter Centralized Intake"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Domestic Violence Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Runaway/ Youth Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Cold Weather Shelters/ Warming Centers"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter for Pregnant Women"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Rent Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Mortgage Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Landlord/ Tenant Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["General Dispute Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Transitional Housing/ Shelter"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Rental Deposit Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"]}

但是当我将yield item更改为yield request以继续抓取时,每个项目都有{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"] ... other item info ... }而不是其各自的子类别和分类。我最终想要从每个分类中获得的每个项目都被删除,但如上所述,它标记错误。知道发生了什么吗?

1 个答案:

答案 0 :(得分:0)

这可能是范围问题。您应该始终尝试在可能的最高范围内创建项目以防止数据保留,即如果当前item没有Taxonomy字段,则对象将保留上一个循环周期中的数据。这就是为什么代码应该在每个循环周期中尽可能创建新对象的原因。

试试这个:

def parse_subandtaxonomy(self, response):
    for sub in response.xpath('//div[@class = "page-content"]/section'):
        subcategory = sub.xpath('h2/text()').extract()
        subcategory = sub.xpath('h2/text()').extract_first()  # this just takes first element which is nicer!
        for tax in sub.xpath('ul/li/a'):
            item = response.meta['item'].copy()
            item['Subcategory'] = subcategory
            item['Taxonomy'] = tax.xpath('text()').extract()
            for href in tax.xpath('@href'):
                # url = response.urljoin(href.extract()) - > this gave me 301 redirects
                badurl = urljoin('https://211sepa.org/search/', href.extract())
                url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
                request = scrapy.Request(url, 
                                         callback=self.parse_listings,
                                         meta={'item': item})  # you can put meta here directly
                yield request