使用Python进行网页抓取有时无法获取结果

时间:2019-01-14 11:30:43

标签: python html beautifulsoup

我正在尝试为视频抓取Reddit页面。我正在使用python和漂亮的汤来完成这项工作。以下代码有时返回结果,有时在我重新运行代码时不返回。我不确定我要去哪里。有人可以帮忙吗?我是python的新手,所以请多包涵。

import requests
from bs4 import BeautifulSoup


page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

2 个答案:

答案 0 :(得分:1)

如果您在print (page)之后进行了page = requests.get('https:/.........'),您会发现自己获得了成功的<Response [200]>

但是,如果您再次快速运行它,则会得到<Response [429]>

“ HTTP 429太多请求响应状态代码指示用户在给定的时间内发送了太多请求(“速率限制”)。来源here

另外,如果您查看html源代码,则会看到:

<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>

要添加标题并避免添加429:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)

完整代码:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

输出:

<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]

,等待一两秒钟后,多次运行都没有问题

答案 1 :(得分:0)

我尝试了下面的代码,它对每个请求都有效,增加了30秒的超时时间。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
    soup = BeautifulSoup(page.text, 'lxml')
    source_tags = soup.find_all('source')
    print(source_tags)
else:
    print(page.status_code, page)