我正在尝试从django抓取一个蜘蛛(scrapy),现在的问题是,只有当我们位于顶层目录(带有scrapy.cfg的目录)时才能抓取蜘蛛。那怎么能实现呢?
.../polls/managements/commands/mycommand.py
from django.core.management.base import BaseCommand
from scrapy.cmdline import execute
import os
class Command(BaseCommand):
def run_from_argv(self, argv):
print ('In run_from_argv')
self._argv = argv
return self.execute()
def handle(self, *args, **options):
#os.environ['SCRAPY_SETTINGS_MODULE'] = '/home/nabin/scraptut/newscrawler'
execute(self._argv[1:])
如果我尝试
python manage.py mycommands crawl myspider
那时我将无法做到。因为要使用crawl我需要在scrapy.cfg文件的顶级目录中..所以我想知道,这怎么可能?
答案 0 :(得分:1)
您不需要更改工作目录,除非您想使用.cfg,它可以包含deploy命令的默认选项。
在第一种方法中,您忘记将爬虫路径添加到python路径并正确设置scrapy设置模块:
# file: myapp/management/commands/bot.py
import os
import sys
from django.core.management.base import BaseCommand
from scrapy import cmdline
class Command(BaseCommand):
help = "Run scrapy"
def handle(self, *args, **options):
sys.path.insert(0, '/home/user/mybot')
os.environ['SCRAPY_SETTINGS_MODULE'] = 'mybot.settings'
# Execute expects the list args[1:] to be the actual command arguments.
cmdline.execute(['bot'] + list(args))
答案 1 :(得分:0)
好的,我自己找到了解决方案。
在settings.py中我定义了:
CRAWLER_PATH = os.path.join(os.path.dirname(BASE_DIR), 'required path')
并做了以下事情。
from django.conf import settings
os.chdir(settings.CRAWLER_PATH)