重定向后,Scrapy中未调用回调函数

时间:2014-03-05 10:53:34

标签: python scrapy

我有一个我创建的最小爬虫 - 如下 -

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from sandbox.items import SandboxItem

class SandboxCrawlSpider(CrawlSpider):
    name = 'sandbox_crawl'
    allowed_domains = ['amazonaws.com']
    start_urls = ['http://www.amazonaws.com/']
    rules = (
        Rule(SgmlLinkExtractor(), callback=('parse_item'), follow=True),
    )

    def parse_item(self, response):
        sel = Selector(response)
        i = SandboxItem()
        print response.url

        return i

此处的问题是,我允许的域amazonaws.com重定向到aws.amazon.com

重定向后,抓取工具会抓取整个页面但从不调用回调函数。输出类似于

2014-03-05 15:50:56+0530 [sandbox_crawl] DEBUG: Redirecting (301) to <GET http://aws.amazon.com> from <GET http://www.amazonaws.com/>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Crawled (200) <GET http://aws.amazon.com> (referer: None)
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'aws.amazon.com': <GET http://aws.amazon.com/>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'portal.aws.amazon.com': <GET https://portal.aws.amazon.com/gp/aws/developer/registration/index.html>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'console.aws.amazon.com': <GET https://console.aws.amazon.com/>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'www.youtube.com': <GET http://www.youtube.com/embed/mZ5H8sn_2ZI?autoplay=1&hd=1&rel=0>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'youtube.com': <GET http://youtube.com/embed/jOhbTAU4OPI?autoplay=1&hd=1&rel=0>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'www.powerof60.com': <GET http://www.powerof60.com/?00N500000026nJd=BA_AWSHP_IntelAWS_Generic&sc_icampaign=ha_en_intel_power_of_60_ed&sc_icampaigntype=partners&sc_ichannel=ha&sc_icountry=us&sc_ipage=homepage&sc_iplace=editorial_r3_right_banner&utm_campaign=BA_AWSHP_IntelAWS&utm_content=GenericAWS&utm_medium=banner&utm_source=AWSHP>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'aws.typepad.com': <GET http://aws.typepad.com/>
2014-03-05 15:50:58+0530 [sandbox_crawl] DEBUG: Filtered offsite request to 'phx.corporate-ir.net': <GET http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=irol-InfoReq>
2014-03-05 15:50:58+0530 [sandbox_crawl] INFO: Closing spider (finished)

如所见,parse_item永远不会被调用.'print response.url'没有效果,函数中也没有任何其他语句。蜘蛛有没有错误?

2 个答案:

答案 0 :(得分:0)

只需将"aws.amazon.com"添加到allowed_domains

即可
allowed_domains = ['amazonaws.com', 'aws.amazon.com']

答案 1 :(得分:0)

将dont_filter = True添加到请求中,但这最终无法解决问题。

像这样:

Request('http://example.org/', callback = self.func, dont_filter=True)