Scrapy将具有不同变量

时间:2017-03-13 19:08:07

标签: python csv web-scraping scrapy scrapy-spider

我有一个url结尾列表,我想在.csv文件中抓取,如下所示:

run

123

124

125

我想在一个有序队列中的一个蜘蛛中运行所有这些。所以运行MySpider 123,完成之后,用124运行MySpider,然后开启。

类似的东西:

process=CrawlerProcess()
with open('run.csv') as rows:
 for row in DictReader(rows):
  process.crawl(numbers(row['run']))
process.start() 

但是一个接一个地跑。我需要将.csv文件中的变量row ['run']传递给蜘蛛才能使用。

以下是蜘蛛样本:

class MySpider(scrapy.Spider):
 low=row['run']
 high=row['run']+1000
 self.start_urls=['http://www.canada411.ca/res/%s/' % page for page in xrange(low,high)]
 def parse(self,response):
  yield{
   'Number': row['run'],
   'Name': SCRAPPED
  }

process=CrawlerProcess()
with open('run.csv') as rows:
 for row in DictReader(rows):
  process.crawl(numbers)
process.start()

2 个答案:

答案 0 :(得分:0)

以下是一个例子:

https://doc.scrapy.org/en/latest/topics/practices.html

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

configure_logging()
runner=CrawlerRunner()
@defer.inlineCallbacks

def crawl():
 with open('run.csv') as rows:
  for row in DictReader(rows):
   yield runner.crawl(numbers,areacode=row['area code'])
 reactor.stop()

答案 1 :(得分:0)

我使用scrapydo包中的run_spider()来实现这一点 https://pypi.python.org/pypi/scrapydo/0.2.2