如何在python 3中使用Beautifulsoup从下一页获取文本?

时间:2016-06-13 15:40:47

标签: python python-3.x web-scraping beautifulsoup html-parsing

我试图获得团队每个页面的所有游戏结果。到目前为止,我能够让所有对手1与对手2得分并得分。但我不知道如何获取下一页以获取其余数据。我会找到下一页并将其置于while循环中吗?这是我想要的团队的链接

http://www.gosugamers.net/counterstrike/teams/7397-natus-vincere/matches

这就是我到目前为止所获得的所有团队比赛,并且仅在第一页上得分。

def all_match_outcomes():

    for match_outcomes in match_history_url():
        rest_server(True)
        page = requests.get(match_outcomes).content
        soup = BeautifulSoup(page, 'html.parser')

        team_name_element = soup.select_one('div.teamNameHolder')
        team_name = team_name_element.find('h1').text.replace('- Team Overview', '')

        for match_outcome in soup.select('table.simple.gamelist.profilelist tr'):
            opp1 = match_outcome.find('span', {'class': 'opp1'}).text
            opp2 = match_outcome.find('span', {'class': 'opp2'}).text

            opp1_score = match_outcome.find('span', {'class': 'hscore'}).text
            opp2_score = match_outcome.find('span', {'class': 'ascore'}).text

            if match_outcome(True):  # If teams have past matches
                print(team_name, '%s %s:%s %s' % (opp1, opp1_score, opp2_score, opp2))

1 个答案:

答案 0 :(得分:0)

获取最后一页编号并逐页迭代,直到您点击最后一页。

完整的工作代码:

import re

import requests
from bs4 import BeautifulSoup

url = "http://www.gosugamers.net/counterstrike/teams/7397-natus-vincere/matches"

with requests.Session() as session:
    response = session.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # locate the last page link
    last_page_link = soup.find("span", text="Last").parent["href"]
    # extract the last page number
    last_page_number = int(re.search(r"page=(\d+)$", last_page_link).group(1))

    print("Processing page number 1")
    # TODO: extract data

    # iterate over all pages starting from page 2 (since we are already on the page 1)
    for page_number in range(2, last_page_number+1):
        print("Processing page number %d" % page_number)

        link = "http://www.gosugamers.net/counterstrike/teams/7397-natus-vincere/matches?page=%d" % page_number
        response = session.get(link)

        soup = BeautifulSoup(response.content, "html.parser")

        # TODO: extract data