我已经基于简单的文档进行了非常简单的尝试,以使CrawlerProcess从单个文件运行蜘蛛。这是代码:
import scrapy
from scrapy.crawler import CrawlerProcess
class BaseSpider(scrapy.Spider):
def common_parse(self, response):
yield {
'test': response.css("title::text").extract()
}
class MonoprixSpider(BaseSpider):
# Your first spider definition
name = "monoprix_bot"
start_url = ['https://www.monoprix.fr/courses-en-ligne']
def parse(self, response):
self.common_parse(response)
class EbaySpider(BaseSpider):
# Your second spider definition
name = "ebay_bot"
start_url = ['https://www.ebay.fr/']
def parse(self, response):
self.common_parse(response)
process = CrawlerProcess()
process.crawl(MonoprixSpider)
process.crawl(EbaySpider)
process.start() # the script will block here until all crawling jobs are finished
这两个蜘蛛打开和关闭时都不会产生页面标题(作为测试)。我以前将更复杂的Ebay和Monoprix蜘蛛放入了两个不同的项目,并且效果很好...
我缺少明显的东西吗?
答案 0 :(得分:0)
请将 start_url 更改为 start_urls 。
与
start_urls = ['https://www.monoprix.fr/courses-en-ligne']
由于没有start_urls,因此基本上您是在将Spider播种为空。