I have just started integrating Django with Scrapy.
Upon receiving the variable(that is Website url) on the Django side, I want to pass it to the scrapy part so as to crawl it.
This is the code snippet I wrote on the backend.
def post(self, request, format=None):
...
serializer = self.serializer_class(data=data)
if serializer.is_valid():
site = serializer.create(data)
domain = urlparse(site.url).netloc
site_id = site.id
settings = {
'unique_id': unique_id,
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
task = scrapyd.schedule('default', 'icrawler', settings=settings, url=url, domain=domain)
task = {
'task_id': task,
'unique_id': target_id,
'status': 'started'
}
resp = {
'task': task,
'data': serializer.data,
'status': status.HTTP_201_CREATED
}
return Response(resp)
else:
return Response(serializer.errors, status=status.HTTP_400_BAD_REQUEST)
I have created spider on the Django project called icrawler.py.
class IcrawlerSpider(CrawlSpider):
name = 'icrawler'
allowed_domains = ['https://google.com']
start_urls = ['http://https://google.com/']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = {}
return i
As you can see, there is allowed_domains = ['https://google.com']
and start_urls = ['http://https://google.com/']
in the spider.
But I want to replace it with variable passed from Django and start running crawler, upon receiving the variable on the django side.
I am not sure how I can implement it.