如何在scrapy中处理302重定向

时间:2014-04-01 19:42:22

标签: python scrapy http-status-code-302

我在取消网站时收到服务器的302响应:

2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>

我想向GET网址发送请求,而不是重定向。现在我找到了这个中间件:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31

我将此重定向代码添加到了我的middleware.py文件中,并将其添加到了settings.py:

DOWNLOADER_MIDDLEWARES = {
 'street.middlewares.RandomUserAgentMiddleware': 400,
 'street.middlewares.RedirectMiddleware': 100,
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

但我仍然被重定向。这是我为了让这个中间件工作所必须做的吗?我想念一下吗?

6 个答案:

答案 0 :(得分:11)

在这种情况下忘记了中间件,这样就可以了:

meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}

也就是说,当您提出请求时,您需要包含元参数:

yield Request(item['link'],meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [302]
              }, callback=self.your_callback)

答案 1 :(得分:2)

无法解释的302响应,例如从在Web浏览器中加载良好的页面重定向到主页或某个固定页面,通常表示服务器端采取了措施来阻止不希望的活动。

您必须降低抓取速度或使用智能代理(例如Crawlera)或代理轮换服务,并在收到此类响应后重试您的请求。

要重试这种响应,请将'handle_httpstatus_list': [302]添加到源请求的meta中,并检查回调中是否有response.status == 302。如果是这样,请通过产生response.request.replace(dont_filter=True)重试您的请求。

重试时,还应使代码限制任何给定URL的最大重试次数。您可以保留字典以跟踪重试:

class MySpider(Spider):
    name = 'my_spider'

    max_retries = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.retries = {}

    def start_requests(self):
        yield Request(
            'https://example.com',
            callback=self.parse,
            meta={
                'handle_httpstatus_list': [302],
            },
        )

    def parse(self, response):
        if response.status == 302:
            retries = self.retries.setdefault(response.url, 0)
            if retries < self.max_retries:
                self.retries[response.url] += 1
                yield response.request.replace(dont_filter=True)
            else:
                self.logger.error('%s still returns 302 responses after %s retries',
                                  response.url, retries)
            return

根据情况,您可能需要将此代码移至downloader middleware

答案 2 :(得分:1)

  

我将此重定向代码添加到了我的middleware.py文件中,并将其添加到了settings.py:

DOWNLOADER_MIDDLEWARES_BASE表示RedirectMiddleware默认已启用,因此您所做的并不重要。

  

我想向GET网址发送请求,而不是重定向。

如何?服务器会在302请求中以GET回复。如果您再次在同一网址上GET,则会再次重定向。

你想要达到什么目标?

如果您不想被重定向,请参阅以下问题:

答案 3 :(得分:1)

使用HTTPCACHE_ENABLED = True时,我遇到了无限循环重定向的问题。我设法通过设置HTTPCACHE_IGNORE_HTTP_CODES = [301,302]

来避免此问题

答案 4 :(得分:1)

我通过以下方法弄清楚了如何绕过重定向:

1-检查是否在parse()中重定向了。

2-如果重定向,则安排模拟转义此重定向的操作并返回所需的URL进行抓取,您可能需要检查google chrome中的网络行为并模拟请求的POST以返回到您的页面。

3-转到另一个进程,使用回调,然后通过递归循环调用自身来完成所有抓取工作,并在最后放置条件以打破此循环。

在下面的示例中,我曾经绕过免责声明页面,然后返回到我的主网址并开始抓取。

from scrapy.http import FormRequest
import requests


class ScrapeClass(scrapy.Spider):

name = 'terrascan'

page_number = 0


start_urls = [
    Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
              ]


def parse(self, response):

    ''' Here I killed Disclaimer page and continued in belwo proc with follow !!!'''

    # Get Currently Requested URL
    current_url = response.request.url

    # Get All Followed Redirect URLs
    redirect_url_list = response.request.meta.get('redirect_urls')
    # Get First URL Followed by Spiders
    redirect_url_list = response.request.meta.get('redirect_urls')[0]

    # handle redirection as below  ( check redirection !! , got it form redirect.py
    # in \downloadermiddlewares  Folder

    allowed_status = (301, 302, 303, 307, 308)
    if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection

        print(current_url, '<========= am not redirected @@@@@@@@@@')
    else:

        print(current_url, '<====== kill that please %%%%%%%%%%%%%')

        session_requests = requests.session()


        # got all below data from monitoring network behaviour in google chrome when simulating clicking on 'I Agree'

        headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',

                    'ctl00$cphContent$btnAgree': 'I Agree'
                    }
        # headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}

        # Post_ = session_requests.post(current_url, headers=headers_)
        Post_ = session_requests.post(current_url, headers=headers_)

        # if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')

        print(response.url , '<========= check this please')



        return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)



def parse_After_disclaimer(self, response):

    print(response.status)
    print(response.url)

    # put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection 

    if response.url not in [your lis of URLs]:
        print('I am here brother')
        yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)

    else:

        # here you are good to go for scraping work          
        items = TerrascanItem()

        all_td_tags = response.css('td')
        print(len(all_td_tags),'all_td_results',response.url)

        # for tr_ in all_tr_tags:
        parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
        Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()


        if parcel_No:items['parcel_No'] = parcel_No
        else: items['parcel_No'] =''


        yield items

    # Here you put the condition to recursive call of this process again

    #
    ScrapeClass.page_number += 1
    # next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
    next_page = Your URLS[ScrapeClass.page_number]
    print('am in page #', ScrapeClass.page_number, '===', next_page)
    if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1:  # 20
        # print('I am loooooooooooooooooooooooping again')
        yield response.follow(next_page, callback=self.parse_After_disclaimer)

答案 5 :(得分:0)

您可以通过在settings.py中将REDIRECT_ENABLED设置为False来禁用RedirectMiddleware