Python - 如何将来自scrapy.request的响应从yield添加到数组中

时间:2017-04-25 12:27:12

标签: python scrapy

我正在尝试从wiki主权列表中收集不同主权的人群,并在每个响应中将它们添加到数组中。在下面的代码中, allList 应该包含一个字典列表,其中包含[' nation']中国家/地区的名称以及[' demographics']中的人口。非常感谢。

# -*- coding: utf-8 -*-
import scrapy
import logging
import csv
import pprint

class CrawlerSpider(scrapy.Spider):
    name = 'test2Crawler'
    allowed_domains = ['web']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states']
    urlList = []
    output = []
    fields = ["nation", "demographics"]
    filename = "C:\\second_project\\testWrite.csv"

    def __init__(self):
        self.counter = 1
        pass

    def parse(self, response):
        list = []
        item = {}
        for resultHref in response.xpath(
                '//table[contains(@class, "wikitable")]//a[preceding-sibling::span[@class="flagicon"]]'):
            hrefRaw = resultHref.xpath('./@href').extract_first()
            href = response.urljoin(hrefRaw)
            nameC = resultHref.xpath('./text()').extract_first()
            item['href'] = href
            item['nameC'] = nameC
            self.urlList.append(item.copy())

        self.runSpider()

    def parse_item(self, response):
        i = {}
        print "getting called..", self.counter
        i['nation'] = response.meta['Country']
        i['demographics'] = response.xpath(
            '//tr[preceding-sibling::tr/th/a/text()="Population"]/td/text()').extract_first()
        yield i

    def passLinks(self, givenLink):
        self.counter = self.counter + 1
        if self.counter < 10:
            href = givenLink['href']
            nameC = givenLink['nameC']
            yield scrapy.Request(href, callback=self.parse_item, meta={'Country': nameC})
        else:
            pass

    def runSpider(self):
        allList = [list(self.passLinks(token)) for token in self.urlList]
        pprint.pprint(allList)
        with open(self.filename, 'wb') as f:
            writer = csv.DictWriter(f, self.fields)
            writer.writeheader()
            for xItem in allList:
                writer.writerow({'nation': xItem['nation'], 'demographics': xItem['demographics']})

1 个答案:

答案 0 :(得分:0)

似乎这正是在Scrapy中为pipelines.py设置的。问题是回调响应没有按顺序接收或足够快,以便它们存储在单独的数组中。收到回复后,他们会通过Pipelines.py进行处理和存储。以下是关于管道使用的Scrapy文档。非常有用的工具! [https://doc.scrapy.org/en/latest/topics/item-pipeline.html][1]