Web使用BeautifulSoup抓取多个页面

时间:2019-10-22 17:04:09

标签: python-3.x parsing beautifulsoup

我从首页收集了所有必要的信息,但不知道如何从网站的所有页面收集信息。我尝试在其他stackoverflow主题中找到我的解决方案,但一无所知。如果您能帮助我,我将不胜感激。

我的解析站点:https://jaze.ru/forum/topic?id=50&page=1

来源:

from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup

# my_url and cutoff mod_security 
my_url = Request('http://jaze.ru/forum/topic?id=50&page=1', headers={'User-Agent': 'Mozilla/5.0'})
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each name of player
containers = page_soup.findAll("div", {"class":"top-area"})


for container in containers:
    playerName = container.div.a.text.strip()
    print("BattlePass PlayerName: " + playerName)

source2

from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup

# start page
i = 1
while True:
    link = 'https://jaze.ru/forum/topic?id=50&page='+str(i)
    my_url = Request(
        link,
        headers={'User-Agent': 'Mozilla/5.0'}
    )
    i += 1  # increment page no for next run
    uClient = uReq(my_url)
    if uClient.url != link:
        break
    page_html = uClient.read()
    # Check if there was a redirect
    uClient.close()
    # html parsing
    page_soup = soup(page_html, "html.parser")
    # grabs each name of player
    containers = page_soup.findAll("div", {"class": "top-area"})

    # save all info to csv file
    filename = "BattlePassNicknames.csv"
    f = open(filename, "w", encoding="utf-8")

    headers1 = "Member of JAZE Battle Pass 2019\n"

    f.write(headers1)

    for container in containers:
        playerName = container.div.a.text.strip()
        print("BattlePass PlayerName: " + playerName)

        f.write(playerName + "\n")

    f.close()

1 个答案:

答案 0 :(得分:0)

如果page查询参数大于上一个可用页面,则网站会将您重定向到另一个页面,您可以使用它来增加page直到您被重定向。如果您已经知道主题id(在这种情况下为50),则适用此方法。

from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup

# start page
i = 1
while True:
    link = 'https://jaze.ru/forum/topic?id=50&page='+str(i)
    my_url = Request(
        link,
        headers={'User-Agent': 'Mozilla/5.0'}
    )
    i += 1  # increment page no for next run
    uClient = uReq(my_url)
    if uClient.url != link:
        break
    page_html = uClient.read()
    # Check if there was a redirect
    uClient.close()
    # html parsing
    page_soup = soup(page_html, "html.parser")
    # grabs each name of player
    containers = page_soup.findAll("div", {"class": "top-area"})

    for container in containers:
        playerName = container.div.a.text.strip()
        print("BattlePass PlayerName: " + playerName)

输出

BattlePass PlayerName: VANTY3
BattlePass PlayerName: VANTY3
BattlePass PlayerName: KK#キング
BattlePass PlayerName: memories
BattlePass PlayerName: Waffel
BattlePass PlayerName: CynoBap
...
BattlePass PlayerName: Switchback

如果您还想使用随机主题id进行尝试,则必须在代码中的某个地方处理urllib.error.HTTPError,以处理所有404等。