Question

我必须从https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall中选择每个州，然后单击团队排名，然后，我必须获取每个排名团队的href链接。

我已经完成了直到团队排名部分为止，我现在想从分页栏中的所有页面上获取每个排名团队的链接。我仅在首页上获得所有团队的链接，我不知道如何导航到下一页。（下面是代码）

导入请求从bs4导入BeautifulSoup 从urllib.request导入urlopen 汇入

site =“ https://www.maxpreps.com”

url = requests.get（“ https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall”）汤= BeautifulSoup（url.content，“ html.parser”）状态= soup.findAll（'div'，{'class'：'states'}）对于各州的州： all_states = each_state.find_all（'a'，href = True）在all_states中： domain = site + a ['href'] #domain由状态链接组成对于域中的r： page_link =域 page_response = requests.get（page_link）汤= BeautifulSoup（page_response.content，“ html.parser”） for soup.findAll（'a'，attrs = {'href'：re.compile（“ rankings”）}）中的链接： ranks_link =网站+ link.get（'href'） #print（rankings_link）

for ert in rankings_link:
    team_link = rankings_link
    page_response1 = requests.get(team_link)
    soup = BeautifulSoup(page_response1.content, "html.parser")

    My_table = soup.find('table',{'class':'mx-grid sortable rankings-grid'})
    links = My_table.findAll('a')
print(links)

Everett，Methuen，

Answer 1

您可以仅在查询参数中浏览页面。

import requests 
from bs4 import BeautifulSoup

site = "https://www.maxpreps.com"

session = requests.Session()
response = session.get("https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall") 
soup = BeautifulSoup(response.content, "html.parser") 
all_states = soup.find('div', {'class': 'states'}) 

states_list = []
for each in all_states.find_all('a'):
    states_list.append(each['href'].split('=')[-1])
states_list = states_list[:-1]    


team_links = []
url = 'https://www.maxpreps.com/m/rankings/list.aspx'
for state in states_list:
    break_loop = False
    page=1
    while break_loop == False:
        print ('%s: Page %s' %(state, page))
        payload = {
                'page': str(page),
                'ssid': '8d610ab9-220b-465b-9cf0-9f417bce6c65',
                'state': state
                }

        response = requests.get(url, params=payload)
        soup = BeautifulSoup(response.text, "html.parser") 
        table = soup.find('table')
        if table == None:
            break_loop = True

        else:
            page+=1
            links = table.find_all('a')
            for link in links:
                team_links.append('https://www.maxpreps.com' + link['href'])

输出：

print (team_links[:10])
['https://www.maxpreps.com/m/high-schools/central-red-devils-(phenix-city,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/thompson-warriors-(alabaster,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hoover-buccaneers-(hoover,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/oxford-yellow-jackets-(oxford,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mountain-brook-spartans-(birmingham,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hewitt-trussville-huskies-(trussville,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mcgill-toolen-yellowjackets-(mobile,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/lee-generals-(montgomery,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/pinson-valley-indians-(pinson,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/vestavia-hills-rebels-(vestavia-hills,al)/football/default.htm']

使用分页抓取所有href链接

1 个答案: