多个请求导致程序崩溃(使用 BeautifulSoup)

时间:2021-06-15 19:35:53

标签: python beautifulsoup python-requests crash

我正在用 python 编写一个程序,让用户输入多个网站,然后请求并抓取这些网站的标题并输出。但是,当程序超过8个网站时,程序每次都会崩溃。我不确定这是否是内存问题,但我一直在寻找,找不到任何遇到相同问题的人。代码如下(我添加了 9 个列表,因此您只需复制并粘贴代码即可查看问题)。

import requests
from bs4 import BeautifulSoup
lst = ['https://covid19tracker.ca/provincevac.html?p=ON', 'https://www.ontario.ca/page/reopening-ontario#foot-1', 'https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html', 'https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest', 'https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility', 'https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm', 'https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries', 'https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment', 'https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay']
for websites in range(len(lst)):
    url=lst[websites]
    cite = requests.get(url,timeout=10).content
    soup = BeautifulSoup(cite,'html.parser')
    title = soup.find('title').get_text().strip()
    print(title)
print("Didn't crash")

第二个网站没有标题,但不用担心

1 个答案:

答案 0 :(得分:1)

为避免页面崩溃,请在 headers= 中的 requests.get() 参数中添加 user-agent 标头,否则页面会认为您是机器人并阻止您。

cite = requests.get(url, headers=headers, timeout=10).content

就你而言:

import requests
from bs4 import BeautifulSoup


headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"
}

lst = [
    "https://covid19tracker.ca/provincevac.html?p=ON",
    "https://www.ontario.ca/page/reopening-ontario#foot-1",
    "https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html",
    "https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest",
    "https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility",
    "https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm",
    "https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries",
    "https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment",
    "https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay",
]
for websites in range(len(lst)):
    url = lst[websites]
    cite = requests.get(url, headers=headers, timeout=10).content
    soup = BeautifulSoup(cite, "html.parser")
    title = soup.find("title").get_text().strip()
    print(title)
print("Didn't crash")