Question

我正在使用www.apkmirror.com（使用Scrapy蜘蛛）为SitemapSpider构建一个刮刀。到目前为止，以下工作：

DEBUG = True

from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem


class ApkmirrorSitemapSpider(SitemapSpider):
    name = 'apkmirror-spider'
    sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]

    if DEBUG:
        custom_settings = {'CLOSESPIDER_PAGECOUNT': 20}

    def parse(self, response):
        item = ApkmirrorScraperItem()
        item['url'] = response.url
        item['title'] = response.xpath('//h1[@title]/text()').extract_first()
        item['developer'] = response.xpath('//h3[@title]/a/text()').extract_first()
        return item

ApkMirrorScraperItem在items.py中的定义如下：

class ApkmirrorScraperItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    developer = scrapy.Field()

如果我使用命令

从项目目录运行它，则生成JSON输出

scrapy crawl apkmirror-spider -o data.json

是一个JSON词典数组，其中包含键url，title和developer，以及相应的字符串作为值。但是，我想对此进行修改，以便developer的值本身就是一个带有name字段的字典，以便我可以像这样填充它：

item['developer']['name'] = response.xpath('//h3[@title]/a/text()').extract_first()

但是，如果我尝试这样做，我会得到KeyError，如果我根据{{3}初始化developer的{{1}}（Field }）dict。我怎么能这样做呢？

Answer 1

Scrapy在内部实现字段作为dicts，但这并不意味着它们应该作为dicts访问。当你致电item['developer']时，你真正在做的是获得字段的值，而不是字段本身。因此，如果尚未设置该值，则会抛出KeyError。

考虑到这一点，有两种方法可以解决您的问题。

首先，只需将开发人员字段值设置为dict：

def parse(self, response):
    item = ApkmirrorScraperItem()
    item['url'] = response.url
    item['title'] = response.xpath('//h1[@title]/text()').extract_first()
    item['developer'] = {'name': response.xpath('//h3[@title]/a/text()').extract_first()}
    return item

第二个，创建一个新的Developer类并将开发人员值设置为该类的实例：

# this can go to items.py
class Developer(scrapy.Item):
    name = scrapy.Field()

def parse(self, response):
    item = ApkmirrorScraperItem()
    item['url'] = response.url
    item['title'] = response.xpath('//h1[@title]/text()').extract_first()

    dev = Developer()        
    dev['name'] = response.xpath('//h3[@title]/a/text()').extract_first()       
    item['developer'] = dev

    return item

希望有所帮助：）

如何将scrapy.Field填充为字典

1 个答案: