scrapy从路径运行蜘蛛

时间:2016-09-28 18:14:15

标签: python scrapy

关于运行scrapy的一些建议建议这样做是为了通过脚本启动scrapy,或者在IDE中调试等:

from scrapy import cmdline

cmdline.execute(("scrapy runspider spider-file-name.py").split())

只要脚本放在项目目录中,但是如果没有尝试给它一个绝对路径或相对路径,这是可行的。例如:

import os

from scrapy import cmdline

this_file_path = os.path.dirname(os.path.realpath(__file__))
base_path = this_file_path.replace('bootstrap', '')
full_path = base_path + "path/to/spiders/some-spider.py"
print full_path

cmdline.execute(("scrapy runspider " + full_path).split())

有了这个,我得到:

2016-09-28 10:49:29 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot)
2016-09-28 10:49:29 [scrapy] INFO: Overridden settings: {}
Usage
=====
  scrapy runspider [options] <spider_file>

spider-main.py: error: Unable to load '/Users/name/intellij-workspace/crawling/scrape/scrape/spiders/some-spider.py': No module named items

有没有办法从绝对路径运行和调试scrapy蜘蛛?理想情况下,我需要在IDE中进行调试。

1 个答案:

答案 0 :(得分:2)

强烈建议您使用分布式抓取软件,但如果您真的想这样做只是为了进行一些肮脏的测试,那么

import subprocess

project_path="/Users/name/intellij-workspace/crawling/scrape"
subprocess.Popen(["scrapy","runspider","scrape/spiders/some-spider.py"],cwd=project_path)