我正在尝试从wiki主权列表中收集不同主权的人群,并在每个响应中将它们添加到数组中。在下面的代码中, allList 应该包含一个字典列表,其中包含[' nation']中国家/地区的名称以及[' demographics']中的人口。非常感谢。
# -*- coding: utf-8 -*-
import scrapy
import logging
import csv
import pprint
class CrawlerSpider(scrapy.Spider):
name = 'test2Crawler'
allowed_domains = ['web']
start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states']
urlList = []
output = []
fields = ["nation", "demographics"]
filename = "C:\\second_project\\testWrite.csv"
def __init__(self):
self.counter = 1
pass
def parse(self, response):
list = []
item = {}
for resultHref in response.xpath(
'//table[contains(@class, "wikitable")]//a[preceding-sibling::span[@class="flagicon"]]'):
hrefRaw = resultHref.xpath('./@href').extract_first()
href = response.urljoin(hrefRaw)
nameC = resultHref.xpath('./text()').extract_first()
item['href'] = href
item['nameC'] = nameC
self.urlList.append(item.copy())
self.runSpider()
def parse_item(self, response):
i = {}
print "getting called..", self.counter
i['nation'] = response.meta['Country']
i['demographics'] = response.xpath(
'//tr[preceding-sibling::tr/th/a/text()="Population"]/td/text()').extract_first()
yield i
def passLinks(self, givenLink):
self.counter = self.counter + 1
if self.counter < 10:
href = givenLink['href']
nameC = givenLink['nameC']
yield scrapy.Request(href, callback=self.parse_item, meta={'Country': nameC})
else:
pass
def runSpider(self):
allList = [list(self.passLinks(token)) for token in self.urlList]
pprint.pprint(allList)
with open(self.filename, 'wb') as f:
writer = csv.DictWriter(f, self.fields)
writer.writeheader()
for xItem in allList:
writer.writerow({'nation': xItem['nation'], 'demographics': xItem['demographics']})
答案 0 :(得分:0)
似乎这正是在Scrapy中为pipelines.py设置的。问题是回调响应没有按顺序接收或足够快,以便它们存储在单独的数组中。收到回复后,他们会通过Pipelines.py进行处理和存储。以下是关于管道使用的Scrapy文档。非常有用的工具! [https://doc.scrapy.org/en/latest/topics/item-pipeline.html][1]