我的蜘蛛跑到目前为止一直很好。一切都有效,但这一点:
# -*- coding: utf-8 -*-
import scrapy
from info.items import InfoItem
class HeiseSpider(scrapy.Spider):
name = "heise"
start_urls = ['https://www.heise.de/']
def parse(self, response):
print ( "Parse" )
yield scrapy.Request(response.url,callback=self.getSubList)
def getSubList(self,response):
item = InfoItem()
print ( "Sub List: Will it work?" )
yield(scrapy.Request('https://www.test.de/', callback = self.getScore, dont_filter=True))
print ( "Should have" )
yield item
def getScore(self, response):
print ( "--------- Get Score ----------")
print ( response )
return True
输出结果为:
Will it work?
Should have
为什么getScore
没有被调用?
我做错了什么?
编辑:将代码更改为具有相同问题的准系统版本 - 未调用getScore
答案 0 :(得分:4)
刚刚进行了测试,它按预期完成了所有回调:
...
2017-05-13 12:27:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.heise.de/> (referer: None)
Parse
2017-05-13 12:27:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.heise.de/> (referer: https://www.heise.de/)
Sub List: Will it work?
Should have
2017-05-13 12:27:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.test.de/> (referer: https://www.heise.de/)
--------- Get Score ----------
<200 https://www.test.de/>
2017-05-13 12:27:59 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'bool' in <GET https://www.test.de/>
2017-05-13 12:27:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-13 12:27:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 693,
...
没有任何日志记录输出且settings.py遗漏了它有点猜测但很可能在您的settings.py中是ROBOTSTXT_OBEY=True
。
这意味着scrapy会尊重robots.txt文件所施加的任何限制,https://www.test.de有一个不允许抓取的robots.txt。
所以将settings.py中的ROBOTSTXT行更改为ROBOTSTXT_OBEY=False
,它应该可以正常工作。