我正在从Bodybuilding.com抓取一个课程项目的数据,我的目标是抓取会员信息。我成功地在第一页上抓取了20名成员的信息。当我转到第2页时会出现问题。下面突出显示的部分显示索引21到40重复从索引1到20的信息。而且,我不知道为什么。
我认为第28行(粗体)会更新变量及其存储的信息。但它似乎没有改变。这与网站结构有关吗?
感谢您的帮助,谢谢。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json
data = {}
browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
# Going through pagination
pages_remaining = True
counter = 1
index = 0
while pages_remaining:
if counter == 60:
pages_remaining = False
# FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL
**metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**
for x in range(0, len(metrics)):
metrics_children = metrics[index].findChildren()
details = soup.findAll("div", {"class": "bbcDetails"})
individual_details = details[index].findChildren()
if len(individual_details) > 16:
print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
else:
print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)
index = index + 1
counter = counter + 1
try:
# Go to page 2
next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
next_link.click()
index = 0
time.sleep(30)
except NoSuchElementException:
rows_remaining = False
答案 0 :(得分:0)
有必要更新变量html和soup。
try:
# Go to page 2
next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
next_link.click()
index = 0
# update html and soup
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
time.sleep(30)
except NoSuchElementException:
rows_remaining = False
我相信你必须这样做,因为URL不会改变,而且html是使用javascript动态生成的。