Question

我正在从Bodybuilding.com抓取一个课程项目的数据，我的目标是抓取会员信息。我成功地在第一页上抓取了20名成员的信息。当我转到第2页时会出现问题。下面突出显示的部分显示索引21到40重复从索引1到20的信息。而且，我不知道为什么。

我认为第28行（粗体）会更新变量及其存储的信息。但它似乎没有改变。这与网站结构有关吗？

感谢您的帮助，谢谢。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json

data = {}

browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

# Going through pagination
pages_remaining = True
counter = 1
index = 0

while pages_remaining:

    if counter == 60:
        pages_remaining = False

    # FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL

    **metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**

    for x in range(0, len(metrics)):
        metrics_children = metrics[index].findChildren()

        details = soup.findAll("div", {"class": "bbcDetails"})
        individual_details = details[index].findChildren()

        if len(individual_details) > 16:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
        else:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)

        index = index + 1
        counter = counter + 1

    try:
        # Go to page 2
        next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
        next_link.click()
        index = 0
        time.sleep(30)
    except NoSuchElementException:
        rows_remaining = False

Answer 1

有必要更新变量html和soup。

try:
    # Go to page 2
    next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
    next_link.click()
    index = 0

    # update html and soup
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")

    time.sleep(30)

except NoSuchElementException:
    rows_remaining = False

我相信你必须这样做，因为URL不会改变，而且html是使用javascript动态生成的。

通过与BeautifulSoup的分页进行网页搜索

1 个答案: