我已经在python中创建了一个脚本,以从网页的不同链接中获取某些帖子的标题。问题是我尝试播放的网页有时无法为我提供有效的响应,但是当我尝试两次或三次时,我的确得到了有效的响应。
我一直试图以这种方式创建循环,以便脚本将检查我定义的标题是否为空。如果标题为空,则脚本将连续循环4次以查看是否可以成功。但是,在对每个链接进行第四次尝试后,脚本将转到另一个链接以重复相同的操作,直到用尽所有链接。
这是我到目前为止的尝试:
import time
import requests
from bs4 import BeautifulSoup
links = [
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2",
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3",
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4"
]
counter = 0
def fetch_data(link):
global counter
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
try:
title = soup.select_one("p.tcode").text
except AttributeError: title = ""
if not title:
while counter<=4:
time.sleep(1)
print("trying {} times".format(counter))
counter += 1
fetch_data(link)
else:
counter = 0
print("tried with this link:",link)
if __name__ == '__main__':
for link in links:
fetch_data(link)
这是我现在可以在控制台中看到的输出:
trying 0 times
trying 1 times
trying 2 times
trying 3 times
trying 4 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4
我的预期输出:
trying 0 times
trying 1 times
trying 2 times
trying 3 times
trying 4 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
trying 0 times
trying 1 times
trying 2 times
trying 3 times
trying 4 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3
trying 0 times
trying 1 times
trying 2 times
trying 3 times
trying 4 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4
PS I used wrong selector within my script so that I can let it meet the condition I've defined above.
在不满足条件的情况下,如何让我的脚本多次尝试处理每个链接
答案 0 :(得分:1)
我认为按如下所示重新安排您的代码。
import time
import requests
from bs4 import BeautifulSoup
links = [
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2",
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3",
"https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4"
]
def fetch_data(link):
global counter
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
try:
title = soup.select_one("p.tcode").text
except AttributeError: title = ""
if not title:
while counter<=4:
time.sleep(1)
print("trying {} times".format(counter))
counter += 1
fetch_data(link)
if __name__ == '__main__':
for link in links:
counter = 0
fetch_data(link)
print("tried with this link:",link)