为什么response.get有时无法获取整个html页面?

时间:2019-07-18 19:14:11

标签: python beautifulsoup web-crawler response

在爬行this page时,我想提取电影的评级(PG,PG-13等),一切似乎都正常,除了名为“ Reis”的电影。

enter image description here

有一个针对该证书的证书(12),但看来Responses.get尚未下载该部分的HTML代码(beautifulsoup找不到任何东西,我也查看了response.text。)我也遇到了类似的问题urllib.request在某些情况下也是如此。在两种情况下响应均成功(返回200)。处理问题的最佳方法是什么?

这是我的代码:

from requests import get 
from bs4 import BeautifulSoup



def movie_catalog_pages(base_url):
    response = None
    try:
        response = get(base_url)
    except:
        print("Not loaded "+ base_url)

    return response

url = 'https://www.imdb.com/search/title/?release_date=2017-01-01,2017-12-31&sort=num_votes,desc&start=101'
response = movie_catalog_pages(url)
html_soup = BeautifulSoup(response.text, 'html.parser')


movies = html_soup.find_all('div', class_='lister-item mode-advanced')

for movie in movies:

    # Movie number
    try:
        temp = movie.h3.span.text
    except:
        temp = None

    if (temp == None):
        i = (np.NaN)
    else:
        i = (int(temp.replace('.','').replace(',','')))

    # movie certificate
    try:
        temp = movie.p.find('span', class_="certificate").text
    except:
        temp = None
        print('Error================================', i)

    if (temp == None):
        pass
    else:
        print(i,temp)

1 个答案:

答案 0 :(得分:0)

感谢我的评论,我的问题是由我自己的IP地址和我进行爬网的计算机引起的。