我想在IMDB网站上保存一些奖励信息,但是我无法访问所需的JavaScript文本。
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.imdb.com/event/ev0000003/2000',
'https://www.imdb.com/event/ev0000003/2001',
]
for url in urls:
response = requests.get(url_test).content
soup = BeautifulSoup(response, 'html.parser')
soup.find_all('script', {'type':'text/javascript'})
现在,我如何只访问类别信息:
"categories":[{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....
由于我必须为不同的奖项和年份来这样做,所以我的想法是将它们保存在json文件中:
{"award": "oscars",
"year": "2000",
"data": [{"categoryName":"Best Actor in a Leading Role","nominations":[{"primaryNominees":[{"name":"Kevin Spacey","note":null,"imageUrl":.....
}
答案 0 :(得分:2)
数据存储在页面的javascript中,因此您可以例如通过regexp访问它。要解析数据,可以使用json
模块。
例如:
import re
import json
import requests
urls = [
'https://www.imdb.com/event/ev0000003/2000',
'https://www.imdb.com/event/ev0000003/2001',
]
for url in urls:
response = requests.get(url).text
data = json.loads( re.findall(r'IMDbReactWidgets\.NomineesWidget\.push.*?(\{.*\})', response)[0] )
# print(json.dumps(data, indent=4)) # <-- comment this out to print all data
for award in data['nomineesWidgetModel']['eventEditionSummary']['awards']:
if award['awardName'] != 'Oscar':
continue
for category in award['categories']:
print(category['categoryName'])
print('-' * 80)
打印:
Best Actor in a Leading Role
Best Actor in a Supporting Role
Best Actress in a Leading Role
Best Actress in a Supporting Role
Best Art Direction-Set Decoration
Best Cinematography
Best Costume Design
Best Director
Best Documentary, Features
Best Documentary, Short Subjects
Best Effects, Sound Effects Editing
Best Effects, Visual Effects
Best Film Editing
Best Foreign Language Film
...and so on.