使用美丽汤访问JavaScript文本

时间:2019-12-12 19:17:51

标签: python json web-scraping beautifulsoup

我想在IMDB网站上保存一些奖励信息,但是我无法访问所需的JavaScript文本。

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.imdb.com/event/ev0000003/2000',
    'https://www.imdb.com/event/ev0000003/2001',
]

for url in urls:
    response = requests.get(url_test).content
    soup = BeautifulSoup(response, 'html.parser')
    soup.find_all('script', {'type':'text/javascript'})


现在,我如何只访问类别信息:

"categories":[{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....  

由于我必须为不同的奖项和年份来这样做,所以我的想法是将它们保存在json文件中:

{"award": "oscars",  
 "year": "2000",  
 "data": [{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....  
}

1 个答案:

答案 0 :(得分:2)

数据存储在页面的javascript中,因此您可以例如通过regexp访问它。要解析数据,可以使用json模块。

例如:

import re
import json
import requests

urls = [
    'https://www.imdb.com/event/ev0000003/2000',
    'https://www.imdb.com/event/ev0000003/2001',
]

for url in urls:
    response = requests.get(url).text

    data = json.loads( re.findall(r'IMDbReactWidgets\.NomineesWidget\.push.*?(\{.*\})', response)[0] )

    # print(json.dumps(data, indent=4)) # <-- comment this out to print all data

    for award in data['nomineesWidgetModel']['eventEditionSummary']['awards']:
        if award['awardName'] != 'Oscar':
            continue
        for category in award['categories']:
            print(category['categoryName'])

    print('-' * 80)

打印:

Best Actor in a Leading Role
Best Actor in a Supporting Role
Best Actress in a Leading Role
Best Actress in a Supporting Role
Best Art Direction-Set Decoration
Best Cinematography
Best Costume Design
Best Director
Best Documentary, Features
Best Documentary, Short Subjects
Best Effects, Sound Effects Editing
Best Effects, Visual Effects
Best Film Editing
Best Foreign Language Film

...and so on.