您好,我想知道如何将pandas文件的抓取结果传递给创建蜘蛛的模块。
import mySpider as mspider
def main():
spider1 = mspider.MySpider()
process = CrawlerProcess()
process.crawl(spider1)
process.start()
print(len(spider1.result))
蜘蛛:
class MySpider(scrapy.Spider):
name = 'MySpider'
allowed_domains = config.ALLOWED_DOMAINS
result = pd.DataFrame(columns=...)
def start_requests(self):
yield Request(url=...,headers=config.HEADERS, callback=self.parse)
def parse(self, response):
*...Some Code of adding values to result...*
print("size: " + str(len(self.result)))
当解析方法为 1005 时,主方法的打印值为 0 。你能告诉我如何在两者之间传递价值吗?
我想这样做是因为我正在运行多个蜘蛛。他们完成抓取后,我将合并并保存到文件中。
解决方案
def spider_closed(spider, reason):
print("Size" + str(len(spider.result)))
def main():
now = datetime.now()
spider1 = spider.MySpider()
crawler_process = CrawlerProcess()
crawler = crawler_process.create_crawler(spider1)
crawler.signals.connect(spider_closed, signals.spider_closed)
crawler_process.crawl(spider1)
crawler_process.start()
答案 0 :(得分:1)
此行为的主要原因是Scrapy本身的异步特性。 print(len(spider1.result))
行将在调用.parse()
方法之前执行。
有多种方式等待蜘蛛完成。我会做spider_closed
signal:
from scrapy import signals
def spider_closed(spider, reason):
print(len(spider.result))
spider1 = mspider.MySpider()
crawler_process = CrawlerProcess(settings)
crawler = crawler_process.create_crawler()
crawler.signals.connect(spider_closed, signals.spider_closed)
crawler.crawl(spider1)
crawler_process.start()