Question

好的。现在我真的很沮丧。我用漂亮的汤抓取数据，并且页面具有结构化的格式，例如链接是https://www.brightscope.com/ratings/a，而评级通过other。评级后的每个字母（例如a，b，c，...）都有多个页面。我正在尝试创建一个while循环以转到每个页面，并且当存在某种特定条件时，将所有href刮掉（我还没有获得该代码）。但是，当我运行代码时，while循环会继续不停地运行。我如何解决该问题以转到每个页面并搜索要运行的条件，如果找不到该条件，请转到下一个字母？在有人问之前，我已经搜索了代码，并且在继续运行时看不到任何li标记。

例如：https://www.brightscope.com/ratings/A/18是A的最高级别，但它一直运行。

import requests
from bs4 import BeautifulSoup

url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []

for href in soup.findAll('a'):
    if 'href' in href.attrs:
        hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
    if good_ratings.startswith('/ratings/'):
        ratings.append(url[:-9]+good_ratings)

del ratings[0]
del ratings[27:]
count = 1
# So it runs each letter a, b, c, ... 
for each_rating in ratings:
    #Pulls the page
    page = requests.get(each_rating)
    #Does its soup thing
    soup = BeautifulSoup(page.text, 'html.parser')
    #Supposed to stay in A, B, C,... until it can't find the 'li' tag
    while soup.find('li'):
        page = requests.get(each_rating+str(count))
        print(page.url)
        count = count+1
        #Keeps running this and never breaks
    else:
        count = 1
        break

Answer 1

soup.find('li')从未更改。在while循环中，您要做的就是更新变量page和count。您需要使用page变量制作新汤，然后它会改变。也许像这样

while soup.find('li'):
        page = requests.get(each_rating+str(count))
        soup = BeautifulSoup(page.text, 'html.parser')
        print(page.url)
        count = count+1
        #Keeps running this and never breaks

希望这会有所帮助

Answer 2

BeautfulSoup的find()方法找到第一个孩子。这意味着，如果需要遍历所有max(values, key=len)元素，则需要使用findAll（）方法并遍历其结果。

虽然循环与美丽的汤和蟒蛇

2 个答案: