漂亮的find_all不会随机返回值

时间:2019-02-08 11:24:33

标签: python beautifulsoup python-requests

我想从网站上获取一些信息,因此,我转到它的页面并获取链接中包含特定字符串的所有链接。 然后,我循环浏览链接,并希望从链接网页中获取信息。但是有时BeautifullSoup函数不会返回值。

import requests
from bs4 import BeautifulSoup

url = 'https://www.website.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
results = soup.find_all('a')

#creating the list for all links found 
links = []

for result in results:
    if  "speciallink" in result['href']:
        href = 'https://www.website.com'+result['href']
        links.append(href)

links = list(set(links))

for link in links:
        url = link
        r = requests.get(url)
        soup = BeautifulSoup(r.text,'html.parser')
        txt = 'text1'
        results = soup.find_all("div",{'class':txt})
        try:
            date = results[0].string
        except IndexError:
            print("error")

但是当我仅对URL中似乎没有任何值的URL重新使用循环时,它通常会在大多数时间返回值,但有时不返回:

for link in links:
        url = 'https://www.website.com/page_with_error/'
        r = requests.get(url)
        soup = BeautifulSoup(r.text,'html.parser')
        txt = 'text1'
        results = soup.find_all("div",{'class':txt})
        try:
            date = results[0].string
        except IndexError:
            print("error")

这是一个静态网站。可能是因为网站未完成加载?

[临时解决方案]

我做了一个小的修复,可以帮助解决此问题,但我认为这不是一个好的解决方案。我只是使用while循环来获取所需的信息:

import requests
from bs4 import BeautifulSoup

url = 'https://www.website.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
results = soup.find_all('a')

#creating the list for all links found 
links = []
for link in links:
        url = 'https://www.website.com/page_with_error/'
        r = requests.get(url)
        soup = BeautifulSoup(r.text,'html.parser')
        txt = 'text1'
        results = soup.find_all("div",{'class':txt})
        error =0
#new- i added this
        while(len(results) == 0 ):
                error = error +1
                r = requests.get(url)
                soup = BeautifulSoup(r.text,'html.parser')
                txt = 'text1'
                results = soup.find_all("div",{'class':txt})
                if(error>10000):
                        print("got to break that")
                            break
# new-over
        try:
            date = results[0].string
        except IndexError:
            print("error")

0 个答案:

没有答案