我有下一个型号: 命令'collect'(collect_positions.py) - >芹菜任务(tasks.py) - > ScrappySpider(MySpider)......
collect_positions.py:
from django.core.management.base import BaseCommand
from tracker.models import Keyword
from tracker.tasks import positions
class Command(BaseCommand):
help = 'collect_positions'
def handle(self, *args, **options):
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
chunk_size = 1
keywords = Keyword.objects.filter(product=product).values_list('id', flat=True)
chunks_list = list(chunks(keywords, chunk_size))
positions.chunks(chunks_list, 1).apply_async(queue='collect_positions')
return 0
tasks.py:
from app_name.celery import app
from scrapy.settings import Settings
from scrapy_app import settings as scrapy_settings
from scrapy_app.spiders.my_spider import MySpider
from tracker.models import Keyword
from scrapy.crawler import CrawlerProcess
@app.task
def positions(*args):
s = Settings()
s.setmodule(scrapy_settings)
keywords = Keyword.objects.filter(id__in=list(args))
process = CrawlerProcess(s)
process.crawl(MySpider, keywords_chunk=keywords)
process.start()
return 1
我通过命令行运行命令,该命令行创建解析任务。第一个队列成功完成,但其他队列返回错误:
twisted.internet.error.ReactorNotRestartable
请告诉我如何解决此错误? 如果有需要,我可以提供任何数据......
更新1
谢谢你的回答,@ Cherief!我设法运行所有队列,但只启动 start_requests()功能,并且 parse()不会运行。
斗篷蜘蛛的主要功能:
def start_requests(self):
print('STEP1')
yield scrapy.Request(
url='exmaple.com',
callback=self.parse,
errback=self.error_callback,
dont_filter=True
)
def error_callback(self, failure):
print(failure)
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
print(repr(failure))
# if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
print('HttpError on %s', response.url)
# elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
print('DNSLookupError on %s', request.url)
# elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
print('TimeoutError on %s', request.url)
def parse(self, response):
print('STEP2', response)
在控制台中我得到:
STEP1
可能是什么原因?
答案 0 :(得分:0)
这是一个古老的问题:世俗:
这有助于我赢得针对ReactorNotRestartable错误的战斗:last answer from the author of the question
0)pip install crochet
1)import from crochet import setup
2)setup()
- 位于文件的顶部
3)删除2行:
a)d.addBoth(lambda _: reactor.stop())
b)reactor.run()
我遇到了同样的错误问题,花了4个多小时来解决这个问题,请阅读有关它的所有问题。终于找到了一个 - 并分享它。这就是我解决这个问题的方法。来自Scrapy docs的唯一有意义的行是我的代码中的最后两行:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
此代码允许我选择将其名称传递给run_spider
函数的蜘蛛,并在废弃完成后再选择另一个蜘蛛并再次运行它。
在您的情况下,您需要在单独的文件中创建单独的函数,该函数运行您的蜘蛛并从您的task
运行它。通常我这样做:)
P.S。实际上没有办法重新启动 TwistedReactor
。
更新1
我不知道你是否需要调用start_requests()
方法。对我而言,它通常只适用于此代码:
class mySpider(scrapy.Spider):
name = "somname"
allowed_domains = ["somesite.com"]
start_urls = ["https://somesite.com"]
def parse(self, response):
pass
def parse_dir_contents(self, response): #for crawling additional links
pass