满足某些条件时无法摆脱循环

时间:2019-11-21 14:48:34

标签: python python-3.x web-scraping

我已经在python中创建了一个脚本,以从bing获取搜索结果的前400个链接。不确定是否总是会有至少400个结果。在这种情况下,结果数约为300。其目标网页中有10个结果。但是,其余结果可以遍历下一页。问题是当那里没有更多的下一页链接时,网页会一遍又一遍地显示最后的结果。

搜索关键字是michael jackson,并且是完整的 link

  

当没有更多新结果或结果小于400时,如何摆脱循环?

我尝试过:

import time
import requests
from bs4 import BeautifulSoup

link = "https://www.bing.com/search?"

params = {'q': 'michael jackson','first': ''}

def get_bing_results(url):
    q = 1
    while q<=400:
        params['first'] = q
        res = requests.get(url,params=params,headers={
            "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
            })
        soup = BeautifulSoup(res.text,"lxml")
        for link in soup.select("#b_results h2 > a"):
            print(link.get("href"))

        time.sleep(2)
        q+=10

if __name__ == '__main__':
    get_bing_results(link)

1 个答案:

答案 0 :(得分:2)

正如我在评论中提到的,您不能做这样的事情:

import time
import requests
from bs4 import BeautifulSoup

link = "https://www.bing.com/search?"

params = {'q': 'michael jackson','first': ''}

def get_bing_results(url):
    q = 1
    prev_soup = str()
    while q <= 400:
        params['first'] = q
        res = requests.get(url,params=params,headers={
            "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
            })
        soup = BeautifulSoup(res.text,"lxml")
        if str(soup) != prev_soup:
            for link in soup.select("#b_results h2 > a"):
                print(link.get("href"))
            prev_soup = str(soup)
        else:
            break
        time.sleep(2)
        q+=10

if __name__ == '__main__':
    get_bing_results(link)