如何使用bs4从外部链接中收集评论?

时间:2019-08-24 11:01:36

标签: python web-scraping beautifulsoup python-requests imdb

我想为每部电影至少提取20条用户评论,但是我不知道如何循环进入IMDb标题电影,然后通过beautifulsoup提取用户评论。

开始链接=“ https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250”;

title_link(1)=“ https://www.imdb.com/title/tt7131622/?ref_=adv_li_tt”;

user_reviews_link_movie1 =“ https://www.imdb.com/title/tt7131622/reviews?ref_=tt_ov_rt”;

我能够从静态页面中提取列表中每部电影的标题,年份,等级和元得分。

# Import packages and set urls

from requests import get
url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
response = get(url)
print(response.text[:500])

from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)


movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

# Lists to store the scraped data in

names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:

# The name
        name = container.h3.a.text
        names.append(name)
# The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)
# The IMDB rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
# The Metascore
        m_score = container.find('span', class_ = 'metascore').text
        metascores.append(int(m_score))

import pandas as pd
test_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores})
test_df
  1. 实际结果:

    电影年度imdb metascore

    从前......好莱坞(2019)(8.1)(83)

    恐怖故事(2019)(6.5)(61)

    速度与激情:霍布斯与肖(2019)(6.8)(60)

    Avengers:Endgame(2019)(8.6)(78)

  2. 预期:

    电影1年1 imbd1 metascore1评论1

    电影1年1 imbd1 metascore1评论2

    ...

    电影1年1 imbd1 metascore1评论20

    movie2 year2 imbd2 metascore2 review1

    ...

    movie2 year2 imbd2 metascore2评论20

    ...

    movie250 year250 imbd250 metascore250评论20

1 个答案:

答案 0 :(得分:0)

假设我在评论中回答的问题是“是”。

以下是您的初始请求的解决方案。 我们会检查某部电影是否真的有20条评论。如果更少,则收集所有可用的。

技术上解析过程是正确的,我在分配movie_containers = movie_containers[:3]时检查了它。收集所有数据将需要一些时间。

更新:刚刚完成了所有250部电影的信息收集-所有内容均已正确刮刮,因此解决方案本身就是仅供参考。

如果您想进一步解析,我的意思是收集下250部影片的数据,依此类推,您可以为该解析器添加一个以上的循环级别。该过程类似于“提取评论”部分中的过程。

# Import packages and set urls

from requests import get
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
url_header_for_reviews = 'https://www.imdb.com'
url_tail_for_reviews = 'reviews?ref_=tt_urv'
base_response = get(base_url)
html_soup = BeautifulSoup(base_response.text, 'html.parser')

movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

result_df = pd.DataFrame()

# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:

# Reviews extracting
        num_reviews = 20
        # Getting last piece of link puzzle for a movie reviews` link
        url_middle_for_reviews = container.find('a')['href']
        # Opening reviews page of a concrete movie
        response_reviews = get(url_header_for_reviews + url_middle_for_reviews + url_tail_for_reviews)
        reviews_soup = BeautifulSoup(response_reviews.text, 'html.parser')
        # Searching all reviews
        reviews_containers = reviews_soup.find_all('div', class_ = 'imdb-user-review')
        # Check if actual number of reviews is less than target one
        if len(reviews_containers) < num_reviews:
            num_reviews = len(reviews_containers)
        # Looping through each review and extracting title and body
        reviews_titles = []
        reviews_bodies = []
        for review_index in range(num_reviews):
            review_container = reviews_containers[review_index]
            review_title = review_container.find('a', class_ = 'title').text.strip()
            review_body = review_container.find('div', class_ = 'text').text.strip()
            reviews_titles.append(review_title)
            reviews_bodies.append(review_body)
# The name
        name = container.h3.a.text
        names = [name for i in range(num_reviews)]
# The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years = [year for i in range(num_reviews)]
# The IMDB rating
        imdb_rating = float(container.strong.text)
        imdb_ratings = [imdb_rating for i in range(num_reviews)]
# The Metascore
        metascore = container.find('span', class_ = 'metascore').text
        metascores = [metascore for i in range(num_reviews)]

# Gathering up scraped data into result_df
        if result_df.empty:
            result_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies})
        elif num_reviews > 0:
            result_df = result_df.append(pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies}))

顺便说一句,我不确定IMDB是否会让您按循环原样收集所有电影的数据。您可能会获得验证码或重定向到其他页面。如果出现这些问题,我将采用一种简单的解决方案-暂停抓取和/或更改user-agents

暂停(睡眠)可以按以下方式实现:

import time
import numpy as np

time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds

可以在请求中插入user-agent,如下所示:

import requests 
from bs4 import BeautifulSoup

url = ('http://www.link_you_want_to_make_request_on.com/bla_bla')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

Google user-agents的其他一些变体,从中列出并在下次请求中不时更改它们。注意使用的user-agents,其中一些指示移动设备或平板设备,对于它们来说,一个站点(不仅是IMDB)可以以不同于PC的格式提供响应页面-其他标记,其他设计等。因此,通常上述算法仅适用于PC版的页面。