我想从网站上获取一些信息,因此,我转到它的页面并获取链接中包含特定字符串的所有链接。 然后,我循环浏览链接,并希望从链接网页中获取信息。但是有时BeautifullSoup函数不会返回值。
import requests
from bs4 import BeautifulSoup
url = 'https://www.website.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
results = soup.find_all('a')
#creating the list for all links found
links = []
for result in results:
if "speciallink" in result['href']:
href = 'https://www.website.com'+result['href']
links.append(href)
links = list(set(links))
for link in links:
url = link
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
txt = 'text1'
results = soup.find_all("div",{'class':txt})
try:
date = results[0].string
except IndexError:
print("error")
但是当我仅对URL中似乎没有任何值的URL重新使用循环时,它通常会在大多数时间返回值,但有时不返回:
for link in links:
url = 'https://www.website.com/page_with_error/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
txt = 'text1'
results = soup.find_all("div",{'class':txt})
try:
date = results[0].string
except IndexError:
print("error")
这是一个静态网站。可能是因为网站未完成加载?
[临时解决方案]
我做了一个小的修复,可以帮助解决此问题,但我认为这不是一个好的解决方案。我只是使用while循环来获取所需的信息:
import requests
from bs4 import BeautifulSoup
url = 'https://www.website.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
results = soup.find_all('a')
#creating the list for all links found
links = []
for link in links:
url = 'https://www.website.com/page_with_error/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
txt = 'text1'
results = soup.find_all("div",{'class':txt})
error =0
#new- i added this
while(len(results) == 0 ):
error = error +1
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
txt = 'text1'
results = soup.find_all("div",{'class':txt})
if(error>10000):
print("got to break that")
break
# new-over
try:
date = results[0].string
except IndexError:
print("error")