Question

我在ASP网站上使用scrapy，其中所有链接都相似：

javascript:__doPostBack('gridID','Select$0')
javascript:__doPostBack('gridID','Select$1')
....

我可以使用FormRequest关注任何记录的详细信息页面链接：

    # Let's first grab all of the Details links -- we can get everything from them that we want
    for sel in response.xpath("//table[@id='gridID']/tr[td]")[0:20]:
        thisTarget  = sel.xpath("td")[0].xpath("a/@href").extract()[0].split("'")[1]
        thisArg     = sel.xpath("td")[0].xpath("a/@href").extract()[0].split("'")[3]
        yield scrapy.FormRequest.from_response( 
                response,
                formdata={'__EVENTTARGET'   : thisTarget, 
                          '__EVENTARGUMENT' : thisArg,
                          '__EVENTVALIDATION': response.xpath("//input[@id='__EVENTVALIDATION']/@value").extract()[0],
                          '__VIEWSTATE': response.xpath("//input[@id='__VIEWSTATE']/@value").extract()[0]
                         },
                dont_click=True, 
                callback=self.parseDetail,
                dont_filter=True
            )

但是当scrapy一次处理多个项目时，它会批量生成请求。一次五行将导致：

2015-02-20 22:41:19-0500 [spider] DEBUG: Redirecting (302) to <GET http://domain.com/ListingDetail.aspx> from <POST http://domain.com/Listing.aspx>
2015-02-20 22:41:20-0500 [spider] DEBUG: Redirecting (302) to <GET http://domain.com/ListingDetail.aspx> from <POST http://domain.com/Listing.aspx>
2015-02-20 22:41:20-0500 [spider] DEBUG: Redirecting (302) to <GET http://domain.com/ListingDetail.aspx> from <POST http://domain.com/Listing.aspx>
2015-02-20 22:41:21-0500 [spider] DEBUG: Redirecting (302) to <GET http://domain.com/ListingDetail.aspx> from <POST http://domain.com/Listing.aspx>
2015-02-20 22:41:22-0500 [spider] DEBUG: Redirecting (302) to <GET http://domain.com/ListingDetail.aspx> from <POST http://domain.com/Listing.aspx>
2015-02-20 22:41:22-0500 [spider] DEBUG: Crawled (200) <GET http://domain.com/ListingDetail.aspx> (referer: http://domain.com/Listing.aspx)
### Callback executed
2015-02-20 22:41:23-0500 [spider] DEBUG: Crawled (200) <GET http://domain.com/ListingDetail.aspx> (referer: http://domain.com/Listing.aspx)
### Callback executed
2015-02-20 22:41:23-0500 [spider] DEBUG: Crawled (200) <GET http://domain.com/ListingDetail.aspx> (referer: http://domain.com/Listing.aspx)
### Callback executed
2015-02-20 22:41:24-0500 [spider] DEBUG: Crawled (200) <GET http://domain.com/ListingDetail.aspx> (referer: http://domain.com/Listing.aspx)
### Callback executed
2015-02-20 22:41:24-0500 [spider] DEBUG: Crawled (200) <GET http://domain.com/ListingDetail.aspx> (referer: http://domain.com/Listing.aspx)
### Callback executed

这似乎导致所有5个响应都是相同的，这是一些ASP魔法的结果，我想。

我尝试设置REDIRECT_PRIORITY_ADJUST = 100以使重定向更加优先，但成效有限。这样做的最好方法是在16个初始请求之后停止，并执行16个重定向，然后是另一批初始请求，依此类推....

当我在scrapy shell中手动执行操作时，通过fetch每个FormRequest，立即处理重定向并获得预期的响应，即使在获取多个请求时也是如此一排。

因此，我的问题是：

有没有办法让scrapy一直处理请求到HTTP 200响应，立即执行任何重定向？

或者......我的问题的任何其他解决方案可能不明显？

Answer 1

我在使用FormRequest时遇到了同样的问题，并且该站点发回302重定向。对于许多请求，响应是相同的。它似乎是下载器中间件之前的事情，甚至是scrapy请求和扭曲之间的事情，因为我放了一个自定义下载器中间件来查看响应和请求。它有同样的问题。

使用以下scrapy设置找到解决方法。

CONCURRENT_REQUESTS=1
CONCURRENT_REQUESTS_PER_DOMAIN=1

应该有更好的方法。

Scrapy立即跟随302重定向

1 个答案: