使用Scrapy进行抓取时,如何在仍然获得页面信息的同时遵循302重定向?

时间:2019-08-04 19:56:36

标签: python web-scraping scrapy http-status-code-302

曾经努力尝试绕过此302重定向。首先,刮板这一特定部分的重点是获取下一页索引,以便我可以翻页。直接URL不适用于此站点,因此我不能继续浏览下一个或其他内容。为了继续使用parse_details函数抓取实际数据,我必须遍历每个页面并模拟请求。

这对我来说还很新,所以我确保尝试一切可以先找到的东西。我尝试了各种设置(“ REDIRECT_ENABLED”:False,更改了handle_httpstatus_list等),但没有任何方法可以帮助我解决这个问题。目前,我正在尝试跟踪重定向的位置,但这也不起作用。 这是我尝试遵循的一种潜在解决方案的示例。

try:
    print('Current page index: ', page_index)
except: # Will be thrown if page_index wasnt found due to redirection.
    if response.status in (302,) and 'Location' in response.headers:
        location = to_native_str(response.headers['location'].decode('latin1'))
         yield scrapy.Request(response.urljoin(location), method='POST', callback=self.parse)

不进行细节解析等的代码如下:

def parse(self, response):
    table = response.css('td> a::attr(href)').extract()
    additional_page = response.css('span.page_list::text').extract()
        for string_item in additional_page: # The text has some non-breaking 
    # spaces (&nbsp) to ignore. We want the text representing the 
            # current page index only.
            char_list = list(string_item)
            for char in char_list:
                if char.isdigit():
                    page_index = char
                    break # Now that we have the current page index, we 
    # can back out of this loop.

        # Below is where the code breaks; it cannot find page_index since it is 
    # not getting to the site for scraping after redirection.
        try:    
            print('Current page index: ', page_index)


    # To get to the next page, we submit a form request since it is all 
    # setup with javascript instead of simlpy giving a URL to follow.
    # The event target has 'dgTournament' information where the first 
    # piece is always '_ctl1' and the second is '_ctl' followed by 
    # the page index number we want to go to minus one (so if we want 
    # to go to the 8th page, its '_ctl7').
    # Thus we can just plug in the current page index which is equal to 
    # the next we want to hit minus one.

    # Here is how I am making the requests; they work until the (302) 
    # redirection...
    form_data = {"__EVENTTARGET": "dgTournaments:_ctl1:_ctl" + page_index,
                     "__EVENTARGUMENT": {";;AjaxControlToolkit, Version=3.5.50731.0, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:ec0bb675-3ec6-4135-8b02-a5c5783f45f5:de1feab2:f9cec9bc:35576c48"}}

    yield FormRequest(current_LEVEL, formdata=form_data, method="POST", callback=self.parse, priority=2)

或者,解决方案可能是以不同的方式进行分页,而不是提出所有这些请求? 原始链接是

https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx?typeofsubmit=&action=2&keywords=&tournamentid=&sectiondistrict=&city=&state=&zip=&month=0&startdate=&enddate=&day=&year=2019&division=G16&category=28&surface=&onlineentry=&drawssheets=&usertime=&sanctioned=-1&agegroup=Y&searchradius=-1

如果有人能够提供帮助。

1 个答案:

答案 0 :(得分:0)

您不必遵循302请求,而是可以进行POST请求并接收页面的详细信息。以下代码在前5页中打印数据:

import requests
from bs4 import BeautifulSoup 

url = 'https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx'

pages=5

for i in range(pages):

    params={'year':'2019','division':'G16','month':'0','searchradius':'-1'}
    payload={'__EVENTTARGET': 'dgTournaments:_ctl1:_ctl'+str(i)}

    res= requests.post(url,params=params,data=payload)
    soup = BeautifulSoup(res.content,'lxml')

    table=soup.find('table',id='ctl00_mainContent_dgTournaments')

    #pretty print the table contents
    for row in table.find_all('tr'):
        for column in row.find_all('td'):
            text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
            print(text)
        print('-'*10)