曾经努力尝试绕过此302重定向。首先,刮板这一特定部分的重点是获取下一页索引,以便我可以翻页。直接URL不适用于此站点,因此我不能继续浏览下一个或其他内容。为了继续使用parse_details函数抓取实际数据,我必须遍历每个页面并模拟请求。
这对我来说还很新,所以我确保尝试一切可以先找到的东西。我尝试了各种设置(“ REDIRECT_ENABLED”:False,更改了handle_httpstatus_list等),但没有任何方法可以帮助我解决这个问题。目前,我正在尝试跟踪重定向的位置,但这也不起作用。 这是我尝试遵循的一种潜在解决方案的示例。
try:
print('Current page index: ', page_index)
except: # Will be thrown if page_index wasnt found due to redirection.
if response.status in (302,) and 'Location' in response.headers:
location = to_native_str(response.headers['location'].decode('latin1'))
yield scrapy.Request(response.urljoin(location), method='POST', callback=self.parse)
不进行细节解析等的代码如下:
def parse(self, response):
table = response.css('td> a::attr(href)').extract()
additional_page = response.css('span.page_list::text').extract()
for string_item in additional_page: # The text has some non-breaking
# spaces ( ) to ignore. We want the text representing the
# current page index only.
char_list = list(string_item)
for char in char_list:
if char.isdigit():
page_index = char
break # Now that we have the current page index, we
# can back out of this loop.
# Below is where the code breaks; it cannot find page_index since it is
# not getting to the site for scraping after redirection.
try:
print('Current page index: ', page_index)
# To get to the next page, we submit a form request since it is all
# setup with javascript instead of simlpy giving a URL to follow.
# The event target has 'dgTournament' information where the first
# piece is always '_ctl1' and the second is '_ctl' followed by
# the page index number we want to go to minus one (so if we want
# to go to the 8th page, its '_ctl7').
# Thus we can just plug in the current page index which is equal to
# the next we want to hit minus one.
# Here is how I am making the requests; they work until the (302)
# redirection...
form_data = {"__EVENTTARGET": "dgTournaments:_ctl1:_ctl" + page_index,
"__EVENTARGUMENT": {";;AjaxControlToolkit, Version=3.5.50731.0, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:ec0bb675-3ec6-4135-8b02-a5c5783f45f5:de1feab2:f9cec9bc:35576c48"}}
yield FormRequest(current_LEVEL, formdata=form_data, method="POST", callback=self.parse, priority=2)
或者,解决方案可能是以不同的方式进行分页,而不是提出所有这些请求? 原始链接是
https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx?typeofsubmit=&action=2&keywords=&tournamentid=§iondistrict=&city=&state=&zip=&month=0&startdate=&enddate=&day=&year=2019&division=G16&category=28&surface=&onlineentry=&drawssheets=&usertime=&sanctioned=-1&agegroup=Y&searchradius=-1
如果有人能够提供帮助。
答案 0 :(得分:0)
您不必遵循302请求,而是可以进行POST请求并接收页面的详细信息。以下代码在前5页中打印数据:
import requests
from bs4 import BeautifulSoup
url = 'https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx'
pages=5
for i in range(pages):
params={'year':'2019','division':'G16','month':'0','searchradius':'-1'}
payload={'__EVENTTARGET': 'dgTournaments:_ctl1:_ctl'+str(i)}
res= requests.post(url,params=params,data=payload)
soup = BeautifulSoup(res.content,'lxml')
table=soup.find('table',id='ctl00_mainContent_dgTournaments')
#pretty print the table contents
for row in table.find_all('tr'):
for column in row.find_all('td'):
text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
print(text)
print('-'*10)