我在Scrapy项目中有一个spiders.py
,其中包含以下蜘蛛网...
class OneSpider(scrapy.Spider):
name = "s1"
def start_requests(self):
urls = ["url1.com",]
yield scrapy.Request(
url="http://url1.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
name = "s2"
def start_requests(self):
urls = ["url2.com",]
yield scrapy.Request(
url="http://url2.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
如何运行蜘蛛程序s1
和s2
,并将其抓取的结果写入s1.json
和s2.json
?
答案 0 :(得分:1)
Scrapy不支持将多个蜘蛛作为一个进程运行,因此您只需运行两个进程即可:
scrapy crawl s1 -o s1.json
scrapy crawl s2 -o s2.json
如果要在同一终端窗口中执行此操作,则必须:
使用nohup
,例如:
nohup scrapy crawl s1 -o s1.json --logfile s1.log &
使用screen
命令。
$ screen
$ scrapy crawl s1 -o s1.json
$ ctrl+a ctrL+d # detach screen
$ screen
$ scrapy crawl s2 -o s2.json
$ ctrl+a ctrL+d # detach screen
$ screen -r # to reattach to one of your sessions to see how the spider is doing
我个人更喜欢nohup或screen选项,因为它们很干净,并且不会因日志记录和其他操作弄乱您的终端。