好的。现在我真的很沮丧。我用漂亮的汤抓取数据,并且页面具有结构化的格式,例如链接是https://www.brightscope.com/ratings/a
,而评级通过other
。评级后的每个字母(例如a,b,c,...)都有多个页面。我正在尝试创建一个while循环以转到每个页面,并且当存在某种特定条件时,将所有href刮掉(我还没有获得该代码)。但是,当我运行代码时,while循环会继续不停地运行。我如何解决该问题以转到每个页面并搜索要运行的条件,如果找不到该条件,请转到下一个字母?在有人问之前,我已经搜索了代码,并且在继续运行时看不到任何li
标记。
例如:https://www.brightscope.com/ratings/A/18
是A的最高级别,但它一直运行。
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
count = 1
# So it runs each letter a, b, c, ...
for each_rating in ratings:
#Pulls the page
page = requests.get(each_rating)
#Does its soup thing
soup = BeautifulSoup(page.text, 'html.parser')
#Supposed to stay in A, B, C,... until it can't find the 'li' tag
while soup.find('li'):
page = requests.get(each_rating+str(count))
print(page.url)
count = count+1
#Keeps running this and never breaks
else:
count = 1
break
答案 0 :(得分:0)
soup.find('li')
从未更改。在while循环中,您要做的就是更新变量page
和count
。您需要使用page
变量制作新汤,然后它会改变。也许像这样
while soup.find('li'):
page = requests.get(each_rating+str(count))
soup = BeautifulSoup(page.text, 'html.parser')
print(page.url)
count = count+1
#Keeps running this and never breaks
希望这会有所帮助
答案 1 :(得分:0)
BeautfulSoup的find()
方法找到第一个孩子。这意味着,如果需要遍历所有max(values, key=len)
元素,则需要使用findAll()方法并遍历其结果。