Scrapy请求在301时没有传递给回调?

时间:2015-08-02 20:01:54

标签: python web-crawler scrapy

我试图更新一个充满外部网站链接的数据库,出于某种原因,当请求标题/网站/ w / e被移动时,它会跳过回调/ 301标记

def start_requests(self): 

    #... database stuff

    for x in xrange(0, numrows):
        row = cur.fetchone()

        item = exampleItem()

        item['real_id'] = row[0]
        item['product_id'] = row[1]
        url = "http://www.example.com/a/-" + item['real_id'] + ".htm"
        log.msg("item %d request URL is %s" % (item['product_id'], url), log.INFO) # shows right
        request = scrapy.Request(url, callback=self.parse_url)
        request.meta['item'] = item
        yield request

def parse_url(self, response):
    item = response.meta['item']
    item['real_url'] = response.url
    log.msg("item %d new URL is %s" % (item['product_id'], item['real_url']), log.INFO) #doesnt even show the items that have redirected.

Scrapy版本是0.24,我该怎么办?

有趣的事实:它只发生在一些断开的链接上,即使它们来自同一个网站,网址完全相同等等。

1 个答案:

答案 0 :(得分:1)

必须将dont_filter=True参数传递给Response回调函数