Question

我想使用BeautifulSoup并重复检索特定位置的特定网址。您可以想象有4个不同的URL列表，每个列表包含100个不同的URL链接。

我需要在每个列表上始终获取并打印第3个URL，而之前的URL（例如第一个列表中的第3个URL）将指向第2个列表（然后需要获取并打印第3个URL，因此直到第4次检索）。

然而，我的循环只实现了第一个结果（列表1上的第3个URL），我不知道如何将新URL循环回while循环并继续该过程。

这是我的代码：

import urllib.request
import json
import ssl
from bs4 import BeautifulSoup


num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))

url='https://pr4e.dr-chuck.com/tsugi/mod/python-   
data/data/known_by_Fikret.html'
print (url)

count=0
order=0
while count<num:
    context = ssl._create_unverified_context()
    htm=urllib.request.urlopen(url, context=context).read()
    soup=BeautifulSoup(htm)
    for i in soup.find_all('a'):
        order+=1
        if order ==position:
            x=i.get('href')
            print (x)
    count+=1
    url=x        
print ('done')

Answer 1

使用递归这是一个很好的问题。尝试调用递归函数来执行此操作：

def retrieve_urls_recur(url, position, index, deepness):
    if index >= deepness:
        return True
    else:
        plain_text = requests.get(url)
        soup = BeautifulSoup(plain_text)
        links = soup.find_all('a'):
        desired_link = links[position].get('href')
        print desired_link
        return retrieve_urls_recur(desired_link, index+1, deepness)

然后使用所需的参数调用它，在您的情况下：

retrieve_urls_recur(url, 2, 0, 4)

2是url列表中的url索引，0是计数器，4是你想要递归的深度

ps：我使用的是请求而不是urllib，我没有测试过这个，虽然我最终使用了与sucess非常相似的功能

Answer 2

只需通过索引获取find_all()的链接：

while count < num:
    context = ssl._create_unverified_context()
    htm = urllib.request.urlopen(url, context=context).read()

    soup = BeautifulSoup(htm)
    url = soup.find_all('a')[position].get('href')

    count += 1

使用BeautifulSoup循环并检索特定的URL

2 个答案: