我编写了一个简单的网络抓取工具,可以从csv
文件中的地址和邮政编码中提取街道名称和该街道的序列号。我想将街道名称,序列号和邮政编码保存在新的csv
文件中,但我不知道如何将邮政编码传递给我的parse()
方法,因为我打电话来自cmd
的蜘蛛:
scrapy crawl Geospider -o Scraped_data.csv -t csv
这是我的蜘蛛(代码实际上不起作用,因为我正在抓取的页面需要登录名和密码,我不会提供我的,但任何人都可以在http://download.kortforsyningen.dk//content/opret-mig-som-bruger上注册为用户,这不是我问题的一部分):
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from scrapy.item import Item, Field
import csv
class Road(Item):
RoadNum = Field()
RoadName = Field()
PostNum = Field()
class Geospider(BaseSpider):
name = 'Geospider'
allowed_domains = ["http://kortforsyningen.kms.dk/"]
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
filename = 'AddressesAndZipcodes.csv'
reader = unicode_csv_reader(open(filename))
start_urls = []
ZipCode = []
for row in reader:
Address = row[0]
Zip = row[1]
start_urls.append('http://kortforsyningen.kms.dk/service?ServiceName=geoV&soegemetode=0&vejnavn=%s&kommunepost=%s&format=XML&max_hits=10&login=xxx&password=xxx' % (Address, ZipCode))
ZipCode.append(Zip)
def parse(self, response):
xxs = XmlXPathSelector(response)
sites = xxs.select('//dokument/forekomst')
items = Road()
items['RoadNum'] = sites.select("vejkode/text()").extract()
items['RoadName'] = sites.select("vejnavn/text()").extract()
items['PostNum'] = ZipCode
yield items, ZipCode
有关如何将ZipCode传递给Parse()
的任何想法,以便将zipcodes与其他结果一起保存?
谢谢
答案 0 :(得分:2)
覆盖start_requests
,在那里阅读csv文件并在zip
中传递request.meta
对您有用:
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from scrapy.item import Item, Field
import csv
class Road(Item):
RoadNum = Field()
RoadName = Field()
PostNum = Field()
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
class Geospider(BaseSpider):
name = 'Geospider'
allowed_domains = ["http://kortforsyningen.kms.dk/"]
start_urls = []
def start_requests(self):
reader = unicode_csv_reader(open('AddressesAndZipcodes.csv'))
for row in reader:
address, zip_code = row[:2]
url = 'http://kortforsyningen.kms.dk/service?ServiceName=geoV&soegemetode=0&vejnavn=%s&kommunepost=%s&format=XML&max_hits=10&login=xxx&password=xxx' % (address, zip_code)
yield Request(url=url, meta={'zip_code': zip_code})
def parse(self, response):
xxs = XmlXPathSelector(response)
sites = xxs.select('//dokument/forekomst')
item = Road()
item['RoadNum'] = sites.select("vejkode/text()").extract()
item['RoadName'] = sites.select("vejnavn/text()").extract()
item['PostNum'] = response.meta['zip_code']
yield item
希望有所帮助。