Django Celery Scrappy错误:twisted.internet.error.ReactorNotRestartable

时间:2018-05-02 18:02:21

标签: python django scrapy celery

我有下一个型号: 命令'collect'(collect_positions.py) - >芹菜任务(tasks.py) - > ScrappySpider(MySpider)......

collect_positions.py:

from django.core.management.base import BaseCommand

from tracker.models import Keyword
from tracker.tasks import positions


class Command(BaseCommand):
    help = 'collect_positions'

    def handle(self, *args, **options):

        def chunks(l, n):
            """Yield successive n-sized chunks from l."""
            for i in range(0, len(l), n):
                yield l[i:i + n]

        chunk_size = 1

        keywords = Keyword.objects.filter(product=product).values_list('id', flat=True)

        chunks_list = list(chunks(keywords, chunk_size))
        positions.chunks(chunks_list, 1).apply_async(queue='collect_positions')

        return 0

tasks.py:

from app_name.celery import app
from scrapy.settings import Settings
from scrapy_app import settings as scrapy_settings
from scrapy_app.spiders.my_spider import MySpider
from tracker.models import Keyword
from scrapy.crawler import CrawlerProcess


@app.task
def positions(*args):
    s = Settings()
    s.setmodule(scrapy_settings)

    keywords = Keyword.objects.filter(id__in=list(args))
    process = CrawlerProcess(s)
    process.crawl(MySpider, keywords_chunk=keywords)
    process.start()

    return 1

我通过命令行运行命令,该命令行创建解析任务。第一个队列成功完成,但其他队列返回错误:

twisted.internet.error.ReactorNotRestartable

请告诉我如何解决此错误? 如果有需要,我可以提供任何数据......

更新1

谢谢你的回答,@ Cherief!我设法运行所有队列,但只启动 start_requests()功能,并且 parse()不会运行。

斗篷蜘蛛的主要功能:

def start_requests(self):
    print('STEP1')

    yield scrapy.Request(
        url='exmaple.com',
        callback=self.parse,
        errback=self.error_callback,
        dont_filter=True
    )

def error_callback(self, failure):
    print(failure)

    # log all errback failures,
    # in case you want to do something special for some errors,
    # you may need the failure's type
    print(repr(failure))

    # if isinstance(failure.value, HttpError):
    if failure.check(HttpError):
        # you can get the response
        response = failure.value.response
        print('HttpError on %s', response.url)

    # elif isinstance(failure.value, DNSLookupError):
    elif failure.check(DNSLookupError):
        # this is the original request
        request = failure.request
        print('DNSLookupError on %s', request.url)

    # elif isinstance(failure.value, TimeoutError):
    elif failure.check(TimeoutError):
        request = failure.request
        print('TimeoutError on %s', request.url)


def parse(self, response):
    print('STEP2', response)

在控制台中我得到:

STEP1

可能是什么原因?

1 个答案:

答案 0 :(得分:0)

这是一个古老的问题:世俗:

这有助于我赢得针对ReactorNotRestartable错误的战斗:last answer from the author of the question
0)pip install crochet
1)import from crochet import setup
2)setup() - 位于文件的顶部
3)删除2行:
a)d.addBoth(lambda _: reactor.stop())
b)reactor.run()

我遇到了同样的错误问题,花了4个多小时来解决这个问题,请阅读有关它的所有问题。终于找到了一个 - 并分享它。这就是我解决这个问题的方法。来自Scrapy docs的唯一有意义的行是我的代码中的最后两行:

#some more imports
from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

此代码允许我选择将其名称传递给run_spider函数的蜘蛛,并在废弃完成后再选择另一个蜘蛛并再次运行它。

在您的情况下,您需要在单独的文件中创建单独的函数,该函数运行您的蜘蛛并从您的task运行它。通常我这样做:)   P.S。实际上没有办法重新启动 TwistedReactor
更新1
我不知道你是否需要调用start_requests()方法。对我而言,它通常只适用于此代码:

class mySpider(scrapy.Spider):
    name = "somname"
    allowed_domains = ["somesite.com"]
    start_urls = ["https://somesite.com"]

    def parse(self, response):
        pass
    def parse_dir_contents(self, response):      #for crawling additional links
        pass