使用 Beautiful Soup 抓取后 IMDb 的重定向超链接不起作用

时间:2021-01-03 12:10:11

标签: python web-scraping beautifulsoup

我正在尝试使用 Beautiful Soup 抓取 IMDb 上标题页的官方网站的数据。 例如,如果我需要获取 Intersteller 的数据,我有这样的代码:

url = 'https://www.imdb.com/title/tt0816692/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
title_detail_soup = soup.find('div', {'id': 'titleDetails'})
details_soup = title_detail_soup.find_all('div', class_='txt-block')
detail_list = ['Official Sites:', 'Country:', 'Language:',
                'Release Date:', 'Also Known As:', 'Filming Locations:']
details = {}
for detail in details_soup:
    try:
        # Each heading (h4) has detail heading
        head = detail.find('h4')
        if head.get_text() in detail_list:
            # If the detail heading is in the detail list
            if head.get_text() == 'Official Sites:':
                # If details is about official sites
                official_site = {}
                detail.h4.decompose()    # remove <h4> tags
                a_tags = detail.find_all('a')
                for a_tag in a_tags:
                    # exclude See more>> links
                    if a_tag.get_text() != 'See more':
                        data = url+a_tag['href']    # final link is base URL + hyperlink
                        official_site[a_tag.get_text()] = data
                details['official-sites'] = official_site
    except Exception as e:
        print(e)
print(details)    # Print the detail dictionary

页面的HTML:

<div class="article" id="titleDetails">
    <span class="rightcornerlink">
        <a href="https://contribute.imdb.com/updates?edit=tt0816692/details&amp;ref_=tt_dt_dt">Edit</a>
    </span>
    <h2>Details</h2>
    <div class="txt-block">
        <h4 class="inline">Official Sites:</h4>
            <a href="/offsite/?page-action=offsite-facebook&amp;token=BCYpckvEa_ZSPp2TC3Ztr1DNqde5ZCUHig7950CLYvsgSHOzBCfJSHpgg71IYRsZYP1DuUpTZb9H%0D%0AhK4BzY5AiKU5Vy2oFn7i91MVFT_TnR39yhU5V5NBAse2mY_ht5WdsmSBxQPGRBC6pIJJym7IXbao%0D%0ATz9SG3r8MjKfwIe9hBrJU5Y-vNdnR_uaDq_24s2NGj5ikJYWl_093YIHy_I2lnK-I6jK9OvOpwgw%0D%0AupABQOymuxA%0D%0A&amp;ref_=tt_pdt_ofs_offsite_0" rel="nofollow">Official Facebook</a>
        <span class="ghost">|</span>
            <a href="/offsite/?page-action=offsite-interstellarmovie&amp;token=BCYuB9Ouy5QXl_3W_k3RrnnXUdrfSLbBFfOcrJTX0yo5TtTDqsSLpry8x7drK8l0xpOJSEqt73Hz%0D%0A08qyki3_i83CrCym7SXSkevFQpT32TjuuJLgIlQ-W5CpRd-wZC9eD4R3SZOMdOfSjeoOtqiE5uU_%0D%0Az-YG1i5AImXY2xLmHSNwABh1hU7VHS-FnqKDW9G-4KOF78zpKdDIfrwlRs8px0yef9u51LojZz05%0D%0A0OBfTmRs_JI%0D%0A&amp;ref_=tt_pdt_ofs_offsite_1" rel="nofollow">Official site</a>
        <span class="ghost">|</span>
        <span class="see-more inline">
            <a href="externalsites?ref_=tt_dt_dt#official">See more</a>&nbsp;»
        </span>
    </div>
</div>

我已经使用它成功地将数据提取到字典格式中,但是当我使用字典中的超链接时,它们不起作用并给出未找到请求的 URL 的错误。

输出字典:

{
    'official-sites': {
        'Official Facebook': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-facebook&token=BCYqzjQrP9OA_yaYNwA9Q8hI5gt41EmHuu0_ePjZPHKui-hEmAEySo-0SHzZmSjpeeEVy3Art6SH%0D%0ATseW16b3uKMjIH8iOyO-ZVYR025mQ4YCbZIWUKEcEM-z0eOeUvud3KGbuQTCxrNhTGAx7xgFIB89%0D%0Al9jT6pvqSpSCdNYACnBhk_8MuNjCn8GIJZk-6PR1MZ1xQB5yDrqRNhNt9Dg8IDMXVpxTR8-LFu2I%0D%0Amf5KmXbmXos%0D%0A', 
        'Official site': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-interstellarmovie&token=BCYsMb9WTKJLH9M9nmxvLDpn8ikQDnQmpVQZBurp9Trd1-XXbA_Bh4xoKx6yf3Qx4YNn3fT9UhFe%0D%0AnzcULcEY5SFJ7CW8kBj6dQvZA9GyvqfZMyIDS7daNe6rne6DkdL23CDPAkk1Xwr9rjiE6FF_m0vX%0D%0ASLH2NnzOf8BcKnaWILhGGdvHTYeZ_uRGm4QCIOzxw-CvLM2rag04ZbXM2ZUEvQm6OedW9XumtsnQ%0D%0AoP7ce67sytE%0D%0A'
    }
}

1 个答案:

答案 0 :(得分:0)

对于所有调用 get_text() 的代码,请确保对象不为空,

尝试使用这个:

url = 'https://www.imdb.com/title/tt0816692/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
title_detail_soup = soup.find('div', {'id': 'titleDetails'})
headings_soup = title_detail_soup.find_all(['h2', 'h3'])
details_soup = title_detail_soup.find_all('div', class_='txt-block')
detail_list = ['Official Sites:', 'Country:', 'Language:',
                'Release Date:', 'Also Known As:', 'Filming Locations:']
details = {}
for detail in details_soup:
    try:
        head = detail.find('h4')
        if head.get_text() in detail_list:
            if head.get_text() == 'Official Sites:':
                official_site = {}
                detail.h4.decompose()
                a_tags = detail.find_all('a')
                for a_tag in a_tags:
                    if a_tag.get_text() != 'See more':
                        data = url +a_tag['href']
                        official_site[a_tag.text] = data
                details['official-sites'] = official_site
    except Exception as e:
        # print(e)
        pass
print(details)