我是使用动态刮刀的新手,我使用以下示例来学习open_news。我已经设置了所有内容,但它让我显示同样的错误:dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.
2015-11-20 18:45:11+0000 [article_spider] ERROR: Spider error processing <GET https://en.wikinews.org/wiki/Main_page>
Traceback (most recent call last):
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 645, in _tick
taskObj._oneWorkUnit()
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/task.py", line 491, in _oneWorkUnit
result = next(self._iterator)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
for x in result:
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/spiders/django_spider.py", line 378, in parse
rpt = self.scraper.get_rpt_for_scraped_obj_attr(url_elem.scraped_obj_attr)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/dynamic_scraper/models.py", line 98, in get_rpt_for_scraped_obj_attr
return self.requestpagetype_set.get(scraped_obj_attr=soa)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/manager.py", line 127, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/suz/social-network-sujit/local/lib/python2.7/site-packages/Django-1.8.5-py2.7.egg/django/db/models/query.py", line 334, in get
self.model._meta.object_name
dynamic_scraper.models.DoesNotExist: RequestPageType matching query does not exist.
答案 0 :(得分:2)
这是由&#34; REQUEST PAGE TYPES&#34;不见了。 每个&#34; SCRAPER ELEMS&#34;必须拥有它自己的&#34;请求页面类型&#34;。
要解决此问题,请按以下步骤操作:
&#34;请求页面类型&#34;设置
所有&#34;内容类型&#34;是&#34; HTML&#34;
全部&#34;请求类型&#34;是&#34;请求&#34;
所有&#34;方法&#34;是&#34;得到&#34;
对于&#34;页面类型&#34;,只需按顺序分配它们,如
(base(Article))|主页
(标题(文章))|细节第1页
(描述(文章)|详情Page 2
(url(Article))|细节第3页
完成上述步骤后,您应该修复&#34; DoesNotExist:RequestPageType&#34;错误。
然而,&#34;错误:强制性的元素标题缺失!&#34;会上来的!
解决这个问题。我建议你改变所有&#34; REQUEST PAGE TYPE&#34; in&#34; SCRAPER ELEMS&#34;到主页&#34;包括&#34; title(文章)&#34;。
然后按如下方式更改XPath:
(base(Article))| // TD [@class =&#34; l_box&#34;]
(标题(文章))|跨度[@class =&#34; l_title&#34;] /一个/ @标题
(description(Article)| p / span [@class =&#34; l_summary&#34;] / text()
(url(Article))|跨度[@class =&#34; l_title&#34;] /一个/ @ HREF
毕竟,在命令提示符下运行scrapy crawl article_spider -a id=1 -a do_action=yes
。
您应该能够抓取&#34;文章&#34;。
您可以在主页> Open_News>文章
享受〜
答案 1 :(得分:1)
我可能参加聚会很晚,但希望我的解决方案对以后遇到的人有所帮助。
@ alan-nala解决方案效果很好。但是,它基本上跳过了详细信息页面的爬取。
您可以在此处充分利用详细信息页面的抓取功能。
首先,转到首页› Dynamic_Scraper› Scrapers› Wikinews Scraper(文章) 和add those中的请求页面类型。
第二,确保您的元素在 SCRAPER ELEMS 中看起来像this。
现在,您可以根据文档运行手动抓取命令
scrapy crawl article_spider -a id=1 -a do_action=yes
好吧,您很可能会遇到@ alan-nala提到的错误
“错误:缺少必需的elem标题!”
请注意the error screenshot,在我的情况下,我收到一条消息,指示脚本为“正在为...调用DP2 URL” 。
最后,您可以返回刮刀元件,并将元素“标题(文章)”的请求页面类型更改为“ 详细信息页面2 ”,而不是“详细信息页1”。
保存设置,然后再次运行scrapy命令。
注意:您的“详细信息页号”可能会有所不同。
顺便说一句,我还准备了由GitHub托管的a short tutorial,以防您需要有关此主题的更多详细信息。