我正在制作一个相对简单的django应用程序,您可以在其中添加文章。除了手动添加文章之外,我还尝试使用scrapy
,并且在您手动添加了文章标题之后,您将从前端访问文章页面,然后按一下按钮即可抓取另一个网站以找到该文章并复制它会找到的特定链接。
问题是我面临导致Internal Server Error
的几个问题。当我不使用选项ValueError: signal only works in main thread
运行django服务器时,错误是--noreload --nothreading
。使用前面提到的选项运行服务器时,我得到raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable
。
我不希望通常使用选项--noreload --nothreading
来运行django服务器,因为我读到它大大限制了性能。
我在django项目中制作了一个单独的应用程序,以保持环境整洁。我正在使用的views.py
是
from django.shortcuts import render, get_object_or_404
from django.urls import reverse_lazy
from templates import *
from .models import *
import scrapy
from scrapy.crawler import CrawlerProcess
def FetchLinks(request, pk, slug):
mas = []
fetched = []
sites = []
# this gets the sites where to search the article
s = Sites.objects.all()
for site in s:
sites.append(site)
if request.method == 'POST':
article = get_object_or_404(Article, id=pk)
sites = SearchLinks(article.title)
# this gets some expressions to find about the article
sr = Srm.objects.all()
for srm in sr:
mas.append(srm)
class MySpider(scrapy.Spider):
name = 'Spider'
start_urls = sites
def start_requests(self):
urls = sites
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for href in response.css("a::attr(href)").extract():
for ms in mas:
if ms in href:
fetched.append(href)
subprocess = CrawlerProcess(settings={
'FEED_FORMAT': 'json',
'FEED_URI': 'items.json',
'LOG_LEVEL': 'WARNING',
})
process.crawl(MySpider)
process.start()
sobj, created = Links.objects.get_or_create(article=article)
if created:
sobj.set_urls(fetched)
else:
sobj.set_urls(fetched)
linksobj.save()
return reverse_lazy('detail', pk=pk, slug=slug)
else:
return render(request, 'content.html')
我的models.py
class Sites(models.Model):
name = models.CharField(max_length=100)
url = models.URLField()
def __str__(self):
return self.name
class Srm(models.Model):
name = models.CharField(max_length=100)
url = models.URLField()
def __str__(self):
return self.name
class Links(models.Model):
urls = models.CharField(max_length=800)
last_updated = models.DateTimeField(auto_now=True)
article = models.OneToOneField(Article, on_delete=models.CASCADE)
def __str__(self):
return self.article
在前端只是一个发出发布请求的按钮。
以下内容来自运行python3 manage.py runserver 8000 --noreload --nothreading
Internal Server Error: /library/fetch/30/an-awesome-article/
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/er/Desktop/projects/proj/projspider/views.py", line 59, in FetchLinks
subprocess.start()
File "/home/er/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1271, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1251, in startRunning
ReactorBase.startRunning(self)
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 754, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
2019-07-27 10:14:48 [django.request] ERROR: Internal Server Error: /library/fetch/30/an-awesome-article/
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/er/Desktop/projects/proj/projspider/views.py", line 59, in FetchLinks
subprocess.start()
File "/home/er/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1271, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1251, in startRunning
ReactorBase.startRunning(self)
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 754, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
[27/Jul/2019 10:14:48] "POST /library/fetch/30/an-awesome-article/ HTTP/1.1" 500 89125
当我运行python3 manage.py runserver 8000
Internal Server Error: /library/fetch/30/an-awesome-article/
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/user/Desktop/projects/proj/projspider/views.py", line 55, in FetchLinks
'LOG_LEVEL': 'WARNING',
File "/home/er/.local/lib/python3.7/site-packages/scrapy/crawler.py", line 268, in __init__
install_shutdown_handlers(self._signal_shutdown)
File "/home/er/.local/lib/python3.7/site-packages/scrapy/utils/ossignal.py", line 22, in install_shutdown_handlers
reactor._handleSignals()
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
_SignalReactorMixin._handleSignals(self)
File "/home/er/.local/lib/python3.7/site-packages/twisted/internet/base.py", line 1232, in _handleSignals
signal.signal(signal.SIGINT, self.sigInt)
File "/usr/lib/python3.7/signal.py", line 47, in signal
handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread
[27/Jul/2019 09:23:04] "POST /library/fetch/30/an-awesome-article/ HTTP/1.1" 500 93336
是否有一些标准化的方法可以将scrapy与django集成,而不是上述方法?
我在网上看到了一些教程,例如this,但是当用户要求this时,如何触发scrapy并没有任何意义,而我不理解scrapy项目如何从中读取模型Django,因此我可以根据从模型中提取的信息来配置Spider。