使用scrapy命令"抓取"来自django

时间:2014-02-03 09:07:52

标签: python django scrapy

我正在尝试从django抓取一个蜘蛛(scrapy),现在的问题是,只有当我们位于顶层目录(带有scrapy.cfg的目录)时才能抓取蜘蛛。那怎么能实现呢?

.../polls/managements/commands/mycommand.py

from django.core.management.base import BaseCommand
from scrapy.cmdline import execute
import os

class Command(BaseCommand):

    def run_from_argv(self, argv):
        print ('In run_from_argv')
        self._argv = argv
        return self.execute()

    def handle(self, *args, **options):
        #os.environ['SCRAPY_SETTINGS_MODULE'] = '/home/nabin/scraptut/newscrawler'
        execute(self._argv[1:])

如果我尝试

python manage.py mycommands crawl myspider
那时我将无法做到。因为要使用crawl我需要在scrapy.cfg文件的顶级目录中..所以我想知道,这怎么可能?

2 个答案:

答案 0 :(得分:1)

您不需要更改工作目录,除非您想使用.cfg,它可以包含deploy命令的默认选项。

在第一种方法中,您忘记将爬虫路径添加到python路径并正确设置scrapy设置模块:

# file: myapp/management/commands/bot.py
import os
import sys

from django.core.management.base import BaseCommand
from scrapy import cmdline


class Command(BaseCommand):
    help = "Run scrapy"

    def handle(self, *args, **options):
        sys.path.insert(0, '/home/user/mybot')
        os.environ['SCRAPY_SETTINGS_MODULE'] = 'mybot.settings'
        # Execute expects the list args[1:] to be the actual command arguments.
        cmdline.execute(['bot'] + list(args))

答案 1 :(得分:0)

好的,我自己找到了解决方案。

在settings.py中我定义了:

CRAWLER_PATH = os.path.join(os.path.dirname(BASE_DIR), 'required path')

并做了以下事情。

from django.conf import settings
os.chdir(settings.CRAWLER_PATH)