Scrapy Max Redirect问题

时间:2012-04-02 15:11:31

标签: python screen-scraping scrapy

我只是想抓取一个页面

start_urls = ['https://www.mileageplusshopping.com/shopping/b____alpha.htm']

但一次又一次地重定向,最后scrapy 丢弃

虽然我试图

REDIRECT_MAX_TIMES=100

此设置也可以重定向100次并且scrapy 丢弃

任何帮助将不胜感激

这是日志。

2012-04-02 20:10:53+0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:53+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
2012-04-02 20:10:54+0500 [mileageplusshopping] DEBUG: Discarding <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>: max redirections reached
2012-04-02 20:10:54+0500 [mileageplusshopping] ERROR: Error downloading <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>: 
2012-04-02 20:10:54+0500 [mileageplusshopping] INFO: Closing spider (finished)

我正在接受scrapy 0.14

这是我的设置类

BOT_NAME = 'mall_crawler'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['mall_crawler.spiders']
NEWSPIDER_MODULE = 'mall_crawler.spiders'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8'

RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 1   

HTTPCACHE_ENABLED = True    
HTTPCACHE_EXPIRATION_SECS = 0

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': None,
}

SCHEDULER_ORDER = 'BFO'

2 个答案:

答案 0 :(得分:2)

我找到了解决方案,所以我想与大家分享

这只是因为

HTTPCACHE_ENABLED = True  

实际上start_url是https://www.mileageplusshopping.com/shopping/b____alpha.htm

重定向到https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false

重定向到https://x.www.mileageplusshopping.com/shopping/b____alpha.htm

最后重定向到https://www.mileageplusshopping.com/shopping/b____alpha.htm

如果您查看第一个请求并且上一个请求都相同

这就是为什么在最后一次请求时它在缓存中找到了这个请求并且循环开始,所以如果我们没有缓存页面,那么一切都很好。

或者如果我们想要缓存页面,我们需要手动处理所有这些。

答案 1 :(得分:0)

我认为这不是REDIRECT_MAX_TIMES问题。我认为这只是一个重定向问题。

您必须找出网页重定向的原因。

为什么呢?可能性:

  1. 看着你的USER_AGENT(我认为这是最有可能的)
  2. 看着你的饼干。
  3. 它使用Javascript执行一些操作,这显然在scrapy中被“禁用”。
  4. 或这些的组合。
  5. 更新:

    我为该网站做了一个测试蜘蛛,看起来它不是一个简单的网站。 Firefox日志显示了这一点:

    [10:21:45.707] GET https://www.mileageplusshopping.com/shopping/b____alpha.htm [HTTP/1.1 302 Found 2128ms]
    [10:21:47.856] GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false [HTTP/1.1 302 Moved Temporarily 517ms]
    [10:21:48.375] GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm [HTTP/1.1 302 Found 1664ms]
    [10:21:50.042] GET https://www.mileageplusshopping.com/shopping/b____alpha.htm [HTTP/1.1 200 OK 3818ms]
    [10:21:53.230] GET https://a248.e.akamai.net/f/248/35975/5d/i.mallnetworks.com/images/css/united/mn_brand_united_noncardholder.css [HTTP/1.0 200 OK 446ms]
    

    我现在的结论是浏览器也被重定向,重定向完成。必须进一步研究(我不是那个专家)。

    另一个更新:

    实际上蜘蛛在这里工作正常:

    class TestSpider(BaseSpider):
        name = "mileageplusshopping_com"
        allowed_domains = ["mileageplusshopping.com"]
        start_urls = [
            'https://www.mileageplusshopping.com/shopping/b____alpha.htm'
        ]
    
        def parse(self, response):
            print 'here'
    

    运行:

    vic@wic:~/projects/test$ scrapy crawl mileageplusshopping_com
    2012-04-03 10:30:40+0300 [scrapy] INFO: Scrapy 0.14.2 started (bot: test)
    2012-04-03 10:30:40+0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
    2012-04-03 10:30:40+0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2012-04-03 10:30:40+0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2012-04-03 10:30:40+0300 [scrapy] DEBUG: Enabled item pipelines: 
    2012-04-03 10:30:40+0300 [mileageplusshopping_com] INFO: Spider opened
    2012-04-03 10:30:40+0300 [mileageplusshopping_com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2012-04-03 10:30:40+0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2012-04-03 10:30:40+0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2012-04-03 10:30:42+0300 [mileageplusshopping_com] DEBUG: Redirecting (302) to <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false> from <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm>
    2012-04-03 10:30:43+0300 [mileageplusshopping_com] DEBUG: Redirecting (302) to <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://www.united.com/web/en-US/apps/sso/LoginBridge.aspx?target=/shopping/b____alpha.htm&redirect=sec&targetURLKey=cartua.bridge.url&remove=false>
    2012-04-03 10:30:44+0300 [mileageplusshopping_com] DEBUG: Redirecting (302) to <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> from <GET https://x.www.mileageplusshopping.com/shopping/b____alpha.htm>
    2012-04-03 10:30:47+0300 [mileageplusshopping_com] DEBUG: Crawled (200) <GET https://www.mileageplusshopping.com/shopping/b____alpha.htm> (referer: None)
    here
    2012-04-03 10:30:47+0300 [mileageplusshopping_com] INFO: Closing spider (finished)
    2012-04-03 10:30:47+0300 [mileageplusshopping_com] INFO: Dumping spider stats:
            {'downloader/request_bytes': 1140,
             'downloader/request_count': 4,
             'downloader/request_method_count/GET': 4,
             'downloader/response_bytes': 68882,
             'downloader/response_count': 4,
             'downloader/response_status_count/200': 1,
             'downloader/response_status_count/302': 3,
             'finish_reason': 'finished',
             'finish_time': datetime.datetime(2012, 4, 3, 7, 30, 47, 879869),
             'scheduler/memory_enqueued': 4,
             'start_time': datetime.datetime(2012, 4, 3, 7, 30, 40, 250275)}
    2012-04-03 10:30:47+0300 [mileageplusshopping_com] INFO: Spider closed (finished)
    2012-04-03 10:30:47+0300 [scrapy] INFO: Dumping global stats:
            {'memusage/max': 88838144, 'memusage/startup': 88838144}
    vic@wic:~/projects/test$