使用Scrapy运行Django时发生内部服务器错误

时间:2019-07-27 09:04:23

标签: django python-3.x scrapy

我正在制作一个相对简单的django应用程序,您可以在其中添加文章。除了手动添加文章之外,我还尝试使用scrapy,并且在您手动添加了文章标题之后,您将从前端访问文章页面,然后按一下按钮即可抓取另一个网站以找到该文章并复制它会找到的特定链接。

问题是我面临导致Internal Server Error的几个问题。当我不使用选项ValueError: signal only works in main thread运行django服务器时,错误是--noreload --nothreading。使用前面提到的选项运行服务器时,我得到raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable

我不希望通常使用选项--noreload --nothreading来运行django服务器,因为我读到它大大限制了性能。

我在django项目中制作了一个单独的应用程序,以保持环境整洁。我正在使用的views.py

from django.shortcuts import render, get_object_or_404
from django.urls import reverse_lazy
from templates import *
from .models import *
import scrapy
from scrapy.crawler import CrawlerProcess


def FetchLinks(request, pk, slug):
    mas = []
    fetched = []
    sites = []

    # this gets the sites where to search the article
    s = Sites.objects.all()
    for site in s:
        sites.append(site)

    if request.method == 'POST':
        article = get_object_or_404(Article, id=pk)
        sites = SearchLinks(article.title)

        # this gets some expressions to find about the article
        sr = Srm.objects.all()
        for srm in sr:
            mas.append(srm)

        class MySpider(scrapy.Spider):
            name = 'Spider'
            start_urls = sites

            def start_requests(self):
                urls = sites
                for url in urls:
                    yield scrapy.Request(url=url, callback=self.parse)

            def parse(self, response):
                for href in response.css("a::attr(href)").extract():
                    for ms in mas:
                        if ms in href:
                            fetched.append(href)

        subprocess = CrawlerProcess(settings={
            'FEED_FORMAT': 'json',
            'FEED_URI': 'items.json',
            'LOG_LEVEL': 'WARNING',
        })

        process.crawl(MySpider)
        process.start()

        sobj, created = Links.objects.get_or_create(article=article)
        if created:
            sobj.set_urls(fetched)
        else:
            sobj.set_urls(fetched)

    linksobj.save()

        return reverse_lazy('detail', pk=pk, slug=slug)
    else:
        return render(request, 'content.html')

我的models.py

class Sites(models.Model):
    name = models.CharField(max_length=100)
    url = models.URLField()

    def __str__(self):
        return self.name


class Srm(models.Model):
    name = models.CharField(max_length=100)
    url = models.URLField()

    def __str__(self):
        return self.name


class Links(models.Model):
    urls = models.CharField(max_length=800)
    last_updated = models.DateTimeField(auto_now=True)
    article = models.OneToOneField(Article, on_delete=models.CASCADE)

    def __str__(self):
        return self.article

在前端只是一个发出发布请求的按钮。

以下内容来自运行python3 manage.py runserver 8000 --noreload --nothreading

Internal Server Error: /library/fetch/30/an-awesome-article/ 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/home/er/Desktop/projects/proj/projspider/views.py", line 59, in FetchLinks
    subprocess.start()
  File "/home/er/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1271, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1251, in startRunning
    ReactorBase.startRunning(self)
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 754, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
2019-07-27 10:14:48 [django.request] ERROR: Internal Server Error: /library/fetch/30/an-awesome-article/ 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/home/er/Desktop/projects/proj/projspider/views.py", line 59, in FetchLinks
    subprocess.start()
  File "/home/er/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1271, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1251, in startRunning
    ReactorBase.startRunning(self)
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 754, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
[27/Jul/2019 10:14:48] "POST /library/fetch/30/an-awesome-article/  HTTP/1.1" 500 89125

当我运行python3 manage.py runserver 8000

Internal Server Error: /library/fetch/30/an-awesome-article/ 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/home/user/Desktop/projects/proj/projspider/views.py", line 55, in FetchLinks
    'LOG_LEVEL': 'WARNING',
  File "/home/er/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 268, in __init__
    install_shutdown_handlers(self._signal_shutdown)
  File "/home/er/.local/lib/python3.7/site-packages/scrapy/utils/ossignal.py", line 22, in install_shutdown_handlers
    reactor._handleSignals()
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
    _SignalReactorMixin._handleSignals(self)
  File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1232, in _handleSignals
    signal.signal(signal.SIGINT, self.sigInt)
  File "/usr/lib/python3.7/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread
[27/Jul/2019 09:23:04] "POST /library/fetch/30/an-awesome-article/  HTTP/1.1" 500 93336

是否有一些标准化的方法可以将scrapy与django集成,而不是上述方法?

我在网上看到了一些教程,例如this,但是当用户要求this时,如何触发scrapy并没有任何意义,而我不理解scrapy项目如何从中读取模型Django,因此我可以根据从模型中提取的信息来配置Spider。

0 个答案:

没有答案