使用分页抓取所有href链接

时间:2019-09-30 09:11:21

标签: python-3.x web-scraping beautifulsoup

我必须从https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall中选择每个州,然后单击团队排名,然后,我必须获取每个排名团队的href链接。

我已经完成了直到团队排名部分为止,我现在想从分页栏中的所有页面上获取每个排名团队的链接。我仅在首页上获得所有团队的链接,我不知道如何导航到下一页。(下面是代码)

导入请求 从bs4导入BeautifulSoup 从urllib.request导入urlopen 汇入

site =“ https://www.maxpreps.com

url = requests.get(“ https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall”) 汤= BeautifulSoup(url.content,“ html.parser”) 状态= soup.findAll('div',{'class':'states'}) 对于各州的州:     all_states = each_state.find_all('a',href = True) 在all_states中:     domain = site + a ['href'] #domain由状态链接组成     对于域中的r:         page_link =域         page_response = requests.get(page_link)         汤= BeautifulSoup(page_response.content,“ html.parser”)         for soup.findAll('a',attrs = {'href':re.compile(“ rankings”)})中的链接:             ranks_link =网站+ link.get('href')     #print(rankings_link)

for ert in rankings_link:
    team_link = rankings_link
    page_response1 = requests.get(team_link)
    soup = BeautifulSoup(page_response1.content, "html.parser")

    My_table = soup.find('table',{'class':'mx-grid sortable rankings-grid'})
    links = My_table.findAll('a')
print(links)

EverettMethuen

1 个答案:

答案 0 :(得分:1)

您可以仅在查询参数中浏览页面。

import requests 
from bs4 import BeautifulSoup

site = "https://www.maxpreps.com"

session = requests.Session()
response = session.get("https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall") 
soup = BeautifulSoup(response.content, "html.parser") 
all_states = soup.find('div', {'class': 'states'}) 

states_list = []
for each in all_states.find_all('a'):
    states_list.append(each['href'].split('=')[-1])
states_list = states_list[:-1]    


team_links = []
url = 'https://www.maxpreps.com/m/rankings/list.aspx'
for state in states_list:
    break_loop = False
    page=1
    while break_loop == False:
        print ('%s: Page %s' %(state, page))
        payload = {
                'page': str(page),
                'ssid': '8d610ab9-220b-465b-9cf0-9f417bce6c65',
                'state': state
                }

        response = requests.get(url, params=payload)
        soup = BeautifulSoup(response.text, "html.parser") 
        table = soup.find('table')
        if table == None:
            break_loop = True

        else:
            page+=1
            links = table.find_all('a')
            for link in links:
                team_links.append('https://www.maxpreps.com' + link['href'])

输出:

print (team_links[:10])
['https://www.maxpreps.com/m/high-schools/central-red-devils-(phenix-city,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/thompson-warriors-(alabaster,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hoover-buccaneers-(hoover,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/oxford-yellow-jackets-(oxford,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mountain-brook-spartans-(birmingham,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hewitt-trussville-huskies-(trussville,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mcgill-toolen-yellowjackets-(mobile,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/lee-generals-(montgomery,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/pinson-valley-indians-(pinson,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/vestavia-hills-rebels-(vestavia-hills,al)/football/default.htm']