我必须从https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall中选择每个州,然后单击团队排名,然后,我必须获取每个排名团队的href链接。
我已经完成了直到团队排名部分为止,我现在想从分页栏中的所有页面上获取每个排名团队的链接。我仅在首页上获得所有团队的链接,我不知道如何导航到下一页。(下面是代码)
导入请求 从bs4导入BeautifulSoup 从urllib.request导入urlopen 汇入
site =“ https://www.maxpreps.com”
url = requests.get(“ https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall”) 汤= BeautifulSoup(url.content,“ html.parser”) 状态= soup.findAll('div',{'class':'states'}) 对于各州的州: all_states = each_state.find_all('a',href = True) 在all_states中: domain = site + a ['href'] #domain由状态链接组成 对于域中的r: page_link =域 page_response = requests.get(page_link) 汤= BeautifulSoup(page_response.content,“ html.parser”) for soup.findAll('a',attrs = {'href':re.compile(“ rankings”)})中的链接: ranks_link =网站+ link.get('href') #print(rankings_link)
for ert in rankings_link:
team_link = rankings_link
page_response1 = requests.get(team_link)
soup = BeautifulSoup(page_response1.content, "html.parser")
My_table = soup.find('table',{'class':'mx-grid sortable rankings-grid'})
links = My_table.findAll('a')
print(links)
答案 0 :(得分:1)
您可以仅在查询参数中浏览页面。
import requests
from bs4 import BeautifulSoup
site = "https://www.maxpreps.com"
session = requests.Session()
response = session.get("https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall")
soup = BeautifulSoup(response.content, "html.parser")
all_states = soup.find('div', {'class': 'states'})
states_list = []
for each in all_states.find_all('a'):
states_list.append(each['href'].split('=')[-1])
states_list = states_list[:-1]
team_links = []
url = 'https://www.maxpreps.com/m/rankings/list.aspx'
for state in states_list:
break_loop = False
page=1
while break_loop == False:
print ('%s: Page %s' %(state, page))
payload = {
'page': str(page),
'ssid': '8d610ab9-220b-465b-9cf0-9f417bce6c65',
'state': state
}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find('table')
if table == None:
break_loop = True
else:
page+=1
links = table.find_all('a')
for link in links:
team_links.append('https://www.maxpreps.com' + link['href'])
输出:
print (team_links[:10])
['https://www.maxpreps.com/m/high-schools/central-red-devils-(phenix-city,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/thompson-warriors-(alabaster,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hoover-buccaneers-(hoover,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/oxford-yellow-jackets-(oxford,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mountain-brook-spartans-(birmingham,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hewitt-trussville-huskies-(trussville,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mcgill-toolen-yellowjackets-(mobile,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/lee-generals-(montgomery,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/pinson-valley-indians-(pinson,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/vestavia-hills-rebels-(vestavia-hills,al)/football/default.htm']