我想为每部电影至少提取20条用户评论,但是我不知道如何循环进入IMDb标题电影,然后通过beautifulsoup提取用户评论。
title_link(1)=“ https://www.imdb.com/title/tt7131622/?ref_=adv_li_tt”;
user_reviews_link_movie1 =“ https://www.imdb.com/title/tt7131622/reviews?ref_=tt_ov_rt”;
我能够从静态页面中提取列表中每部电影的标题,年份,等级和元得分。
# Import packages and set urls
from requests import get
url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
response = get(url)
print(response.text[:500])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
name = container.h3.a.text
names.append(name)
# The year
year = container.h3.find('span', class_ = 'lister-item-year').text
years.append(year)
# The IMDB rating
imdb = float(container.strong.text)
imdb_ratings.append(imdb)
# The Metascore
m_score = container.find('span', class_ = 'metascore').text
metascores.append(int(m_score))
import pandas as pd
test_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores})
test_df
实际结果:
电影年度imdb metascore
从前......好莱坞(2019)(8.1)(83)
恐怖故事(2019)(6.5)(61)
速度与激情:霍布斯与肖(2019)(6.8)(60)
Avengers:Endgame(2019)(8.6)(78)
预期:
电影1年1 imbd1 metascore1评论1
电影1年1 imbd1 metascore1评论2
...
电影1年1 imbd1 metascore1评论20
movie2 year2 imbd2 metascore2 review1
...
movie2 year2 imbd2 metascore2评论20
...
movie250 year250 imbd250 metascore250评论20
答案 0 :(得分:0)
假设我在评论中回答的问题是“是”。
以下是您的初始请求的解决方案。 我们会检查某部电影是否真的有20条评论。如果更少,则收集所有可用的。
技术上解析过程是正确的,我在分配movie_containers = movie_containers[:3]
时检查了它。收集所有数据将需要一些时间。
更新:刚刚完成了所有250部电影的信息收集-所有内容均已正确刮刮,因此解决方案本身就是仅供参考。
如果您想进一步解析,我的意思是收集下250部影片的数据,依此类推,您可以为该解析器添加一个以上的循环级别。该过程类似于“提取评论”部分中的过程。
# Import packages and set urls
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
url_header_for_reviews = 'https://www.imdb.com'
url_tail_for_reviews = 'reviews?ref_=tt_urv'
base_response = get(base_url)
html_soup = BeautifulSoup(base_response.text, 'html.parser')
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
result_df = pd.DataFrame()
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
if container.find('div', class_ = 'ratings-metascore') is not None:
# Reviews extracting
num_reviews = 20
# Getting last piece of link puzzle for a movie reviews` link
url_middle_for_reviews = container.find('a')['href']
# Opening reviews page of a concrete movie
response_reviews = get(url_header_for_reviews + url_middle_for_reviews + url_tail_for_reviews)
reviews_soup = BeautifulSoup(response_reviews.text, 'html.parser')
# Searching all reviews
reviews_containers = reviews_soup.find_all('div', class_ = 'imdb-user-review')
# Check if actual number of reviews is less than target one
if len(reviews_containers) < num_reviews:
num_reviews = len(reviews_containers)
# Looping through each review and extracting title and body
reviews_titles = []
reviews_bodies = []
for review_index in range(num_reviews):
review_container = reviews_containers[review_index]
review_title = review_container.find('a', class_ = 'title').text.strip()
review_body = review_container.find('div', class_ = 'text').text.strip()
reviews_titles.append(review_title)
reviews_bodies.append(review_body)
# The name
name = container.h3.a.text
names = [name for i in range(num_reviews)]
# The year
year = container.h3.find('span', class_ = 'lister-item-year').text
years = [year for i in range(num_reviews)]
# The IMDB rating
imdb_rating = float(container.strong.text)
imdb_ratings = [imdb_rating for i in range(num_reviews)]
# The Metascore
metascore = container.find('span', class_ = 'metascore').text
metascores = [metascore for i in range(num_reviews)]
# Gathering up scraped data into result_df
if result_df.empty:
result_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies})
elif num_reviews > 0:
result_df = result_df.append(pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies}))
顺便说一句,我不确定IMDB是否会让您按循环原样收集所有电影的数据。您可能会获得验证码或重定向到其他页面。如果出现这些问题,我将采用一种简单的解决方案-暂停抓取和/或更改user-agents
。
暂停(睡眠)可以按以下方式实现:
import time
import numpy as np
time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds
可以在请求中插入user-agent
,如下所示:
import requests
from bs4 import BeautifulSoup
url = ('http://www.link_you_want_to_make_request_on.com/bla_bla')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
Google user-agents
的其他一些变体,从中列出并在下次请求中不时更改它们。注意使用的user-agents
,其中一些指示移动设备或平板设备,对于它们来说,一个站点(不仅是IMDB)可以以不同于PC的格式提供响应页面-其他标记,其他设计等。因此,通常上述算法仅适用于PC版的页面。