如何使代码循环遍历页面上的所有项目?

时间:2019-07-06 05:38:50

标签: python-3.x web-scraping beautifulsoup

我正在尝试抓取IMDB列表,但是目前我在表中打印出的所有内容都是第一部电影(玩具总动员)。

我尝试初始化count = 0,然后尝试在for循环结束时更新first_movie = movie_containers [count + 1],但它不起作用。无论我尝试什么,都会遇到各种错误,例如“数组的长度必须相同”。如我所说,当它起作用时,只有页面上的第一部电影被打印到表中50次。

from bs4 import BeautifulSoup 
from requests import get
import pandas as pd

url = 'https://www.imdb.com/search/title/?genres=comedy&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=3PWY0EZBAKM22YP2F114&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_1'

response = get(url)


html = BeautifulSoup(response.text, 'lxml')

movie_containers = html.find_all('div', class_='lister-item mode-advanced')
first_movie = movie_containers[0]

name = first_movie.h3.a.text

year = first_movie.find('span', class_='lister-item-year text-muted unbold').text

rating = float(first_movie.find('div', class_='inline-block ratings-imdb-rating').text.strip())

metascore = int(first_movie.find('span', class_='metascore favorable').text)

vote = first_movie.find('span', attrs={'name':'nv'})
vote = vote['data-value']

gross = first_movie.find('span', attrs={'data-value':'272,257,544'})
gross = '$' + gross['data-value']

info_container = first_movie.findAll('p', class_='text-muted')[0]

certificate = info_container.find('span', class_='certificate').text

runtime = info_container.find('span', class_='runtime').text

genre = info_container.find('span', class_='genre').text.strip()

description = first_movie.findAll('p', class_='text-muted')[1].text.strip()

#second_movie_metascore = movie_containers[1].find('div', class_='ratings-metascore')

names = []
years = []
ratings = []
metascores = []
votes = []
grossing = []
certificates = []
runtimes = []
genres = []
descriptions = []

for container in movie_containers:
    try:
        name = first_movie.h3.a.text
        names.append(name)
    except:
        continue

    try:
        year = first_movie.find('span', class_='lister-item-year text-muted unbold').text
        years.append(year)
    except:
        continue

    try:
        rating = float(first_movie.find('div', class_='inline-block ratings-imdb-rating').text.strip())
        ratings.append(rating)
    except:
        continue

    try:
        metascore = int(first_movie.find('span', class_='metascore favorable').text)
        metascores.append(metascore)
    except:
        continue

    try:
        vote = first_movie.find('span', attrs={'name':'nv'})
        vote = vote['data-value']
        votes.append(vote)
    except:
        continue

    try:
        gross = first_movie.find('span', attrs={'data-value':'272,257,544'})
        gross = '$' + gross['data-value']
        grossing.append(gross)
    except:
        continue

    try:
        certificate = info_container.find('span', class_='certificate').text
        certificates.append(certificate)
    except:
        continue

    try:
        runtime = info_container.find('span', class_='runtime').text
        runtimes.append(runtime)
    except:
        continue

    try:
        genre = info_container.find('span', class_='genre').text.strip()
        genres.append(genre)
    except:
        continue

    try:
        description = first_movie.findAll('p', class_='text-muted')[1].text.strip()
        descriptions.append(description)
    except:
        continue


test_df = pd.DataFrame({'Movie': names,
'Year': years,
'IMDB': ratings,
'Metascore': metascores,
'Votes': votes,
'Gross': grossing,
'Certificate': certificates,
'Runtime': runtimes,
'Genres': genres,
'Descriptions': descriptions

})
#print(test_df.info())
print(test_df)

此外,当打印出表格时,如何从1开始而不是从0开始pd列表?

1 个答案:

答案 0 :(得分:1)

您可以尝试使用此代码来抓取数据。我现在将其打印在屏幕上,但是您会将数据放入Panda的数据框中:

from bs4 import BeautifulSoup
import requests
import textwrap

url = 'https://www.imdb.com/search/title/?genres=comedy&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=3PWY0EZBAKM22YP2F114&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')

names = []
years = []
ratings = []
metascores = []
votes = []
grossing = []
certificates = []
runtimes = []
genres = []
descriptions = []

for i in soup.select('.lister-item-content'):
    for t in i.select('h3 a'):
        names.append(t.text)
        break
    else:
        names.append('-')

    for t in i.select('.lister-item-year'):
        years.append(t.text)
        break
    else:
        years.append('-')

    for t in i.select('.ratings-imdb-rating'):
        ratings.append(t.text.strip())
        break
    else:
        ratings.append('-')

    for t in i.select('.metascore'):
        metascores.append(t.text.strip())
        break
    else:
        metascores.append('-')

    for t in i.select('.sort-num_votes-visible span:contains("Votes:") + span[data-value]'):
        votes.append(t['data-value'])
        break
    else:
        votes.append('-')

    for t in i.select('.sort-num_votes-visible span:contains("Gross:") + span[data-value]'):
        grossing.append(t['data-value'])
        break
    else:
        grossing.append('-')

    for t in i.select('.certificate'):
        certificates.append(t.text.strip())
        break
    else:
        certificates.append('-')

    for t in i.select('.runtime'):
        runtimes.append(t.text.strip())
        break
    else:
        runtimes.append('-')

    for t in i.select('.genre'):
        genres.append(t.text.strip().split(','))
        break
    else:
        genres.append('-')

    for t in i.select('p.text-muted')[1:2]:
        descriptions.append(t.text.strip())
        break
    else:
        descriptions.append('-')

for row in zip(names, years, ratings, metascores, votes, grossing, certificates, runtimes, genres, descriptions):
    for col_num, data in enumerate(row):
        if col_num == 0:
            t = textwrap.shorten(str(data), 35)
            print('{: ^35}'.format(t), end='|')
        elif col_num in (1, 2, 3, 4, 5, 6, 7):
            t = textwrap.shorten(str(data), 12)
            print('{: ^12}'.format(t), end='|')
        else:
            t = textwrap.shorten(str(data), 35)
            print('{: ^35}'.format(t), end='|')

    print()

打印:

            Toy Story 4            |   (2019)   |    8.3     |     84     |   50496    |272,257,544 |     G      |  100 min   |['Animation', ' Adventure', ' [...]|When a new toy called "Forky" [...]|
         Charlie's Angels          |   (2019)   |     -      |     -      |     -      |     -      |     -      |     -      |['Action', ' Adventure', ' Comedy']|  Reboot of the 2000 action [...]  |
          Murder Mystery           |   (2019)   |    6.0     |     38     |   46255    |     -      |   PG-13    |   97 min   |  ['Action', ' Comedy', ' Crime']  | A New York cop and his wife [...] |
             Eile veel             |(III) (2019)|    7.1     |     56     |   10539    | 26,132,740 |   PG-13    |  116 min   | ['Comedy', ' Fantasy', ' Music']  |    A struggling musician [...]    |
    Mehed mustas: globaalne oht    |   (2019)   |    5.7     |     38     |   24338    | 66,894,949 |   PG-13    |  114 min   |['Action', ' Adventure', ' Comedy']|The Men in Black have always [...] |
            Good Omens             |   (2019)   |    8.3     |     -      |   24804    |     -      |     -      |   60 min   |      ['Comedy', ' Fantasy']       |  A tale of the bungling of [...]  |
        Ükskord Hollywoodis        |   (2019)   |    9.6     |     88     |    6936    |     -      |     -      |  159 min   |       ['Comedy', ' Drama']        |A faded television actor and [...] |
              Aladdin              |   (2019)   |    7.4     |     53     |   77230    |313,189,616 |     PG     |  128 min   |['Adventure', ' Comedy', ' Family']|A kind-hearted street urchin [...] |
           Mr. Iglesias            |  (2019– )  |    7.2     |     -      |    2266    |     -      |     -      |   30 min   |            ['Comedy']             | A good-natured high school [...]  |
              Shazam!              |   (2019)   |    7.3     |     70     |   129241   |140,105,000 |   PG-13    |  132 min   |['Action', ' Adventure', ' Comedy']|   We all have a superhero [...]   |
               Shaft               |   (2019)   |    6.4     |     40     |   12016    | 19,019,975 |     R      |  111 min   |  ['Action', ' Comedy', ' Crime']  |   John Shaft Jr., a cyber [...]   |
              Kontor               |(2005–2013) |    8.8     |     -      |   301620   |     -      |     -      |   22 min   |            ['Comedy']             |A mockumentary on a group of [...] |
              Sõbrad               |(1994–2004) |    8.9     |     -      |   683205   |     -      |     -      |   22 min   |      ['Comedy', ' Romance']       |  Follows the personal and [...]   |
             Lelulugu              |   (1995)   |    8.3     |     95     |   800957   |191,796,233 |     -      |   81 min   |['Animation', ' Adventure', ' [...]| A cowboy doll is profoundly [...] |
            Lelulugu 3             |   (2010)   |    8.3     |     92     |   689098   |415,004,880 |     -      |  103 min   |['Animation', ' Adventure', ' [...]|   The toys are mistakenly [...]   |
      Orange Is the New Black      |  (2013– )  |    8.1     |     -      |   256417   |     -      |     -      |   59 min   |  ['Comedy', ' Crime', ' Drama']   |  Convicted of a decade old [...]  |
        Brooklyn Nine-Nine         |  (2013– )  |    8.4     |     -      |   154342   |     -      |     -      |   22 min   |       ['Comedy', ' Crime']        | Jake Peralta, an immature, [...]  |
        Always Be My Maybe         |   (2019)   |    6.9     |     64     |   26210    |     -      |   PG-13    |  101 min   |      ['Comedy', ' Romance']       | A pair of childhood friends [...] |
        The Dead Don't Die         |   (2019)   |    6.0     |     54     |    6841    | 6,116,830  |     R      |  104 min   | ['Comedy', ' Fantasy', ' Horror'] |    The peaceful town of [...]     |
        Suure Paugu teooria        |(2007–2019) |    8.2     |     -      |   653122   |     -      |     -      |   22 min   |      ['Comedy', ' Romance']       |  A woman who moves into an [...]  |
            Lelulugu 2             |   (1999)   |    7.9     |     88     |   476104   |245,852,179 |     -      |   92 min   |['Animation', ' Adventure', ' [...]|When Woody is stolen by a toy [...]|
  Fast & Furious Presents: [...]   |   (2019)   |     -      |     -      |     -      |     -      |   PG-13    |     -      |['Action', ' Adventure', ' Comedy']|Lawman Luke Hobbs and outcast [...]|
            Dead to Me             |  (2019– )  |    8.2     |     -      |   23149    |     -      |     -      |   30 min   |       ['Comedy', ' Drama']        |  A series about a powerful [...]  |
         Pintsaklipslased          |  (2011– )  |    8.5     |     -      |   328568   |     -      |     -      |   44 min   |       ['Comedy', ' Drama']        | On the run from a drug deal [...] |
     The Secret Life of Pets 2     |   (2019)   |    6.6     |     55     |    8613    |135,983,335 |     PG     |   86 min   |['Animation', ' Adventure', ' [...]| Continuing the story of Max [...] |
            Good Girls             |  (2018– )  |    7.9     |     -      |   18518    |     -      |     -      |   43 min   |  ['Comedy', ' Crime', ' Drama']   |   Three suburban mothers [...]    |
     Ralph Breaks the Internet     |   (2018)   |    7.1     |     71     |   91165    |201,091,711 |     PG     |  112 min   |['Animation', ' Adventure', ' [...]|Six years after the events of [...]|
             Trolls 2              |   (2020)   |     -      |     -      |     -      |     -      |     -      |     -      |['Animation', ' Adventure', ' [...]| Sequel to the 2016 animated hit.  |
             Booksmart             |   (2019)   |    7.4     |     84     |   24935    | 21,474,121 |     R      |  102 min   |            ['Comedy']             |  On the eve of their high [...]   |
       The Old Man & the Gun       |   (2018)   |    6.8     |     80     |   27337    | 11,277,120 |   PG-13    |   93 min   |['Biography', ' Comedy', ' Crime'] | Based on the true story of [...]  |
              Fleabag              |  (2016– )  |    8.6     |     -      |   25041    |     -      |     -      |   27 min   |       ['Comedy', ' Drama']        |A comedy series adapted from [...] |
          Schitt's Creek           |  (2015– )  |    8.2     |     -      |   18112    |     -      |     -      |   22 min   |            ['Comedy']             |When rich video-store magnate [...]|
             Catch-22              |  (2019– )  |    7.9     |     -      |    6829    |     -      |     -      |   45 min   |  ['Comedy', ' Crime', ' Drama']   |Limited series adaptation of [...] |
              Häbitu               |  (2011– )  |    8.7     |     -      |   171782   |     -      |     -      |   46 min   |       ['Comedy', ' Drama']        |  A scrappy, fiercely loyal [...]  |
          Jane the Virgin          |  (2014– )  |    7.8     |     -      |   30106    |     -      |     -      |   60 min   |            ['Comedy']             |  A young, devout Catholic [...]   |
       Parks and Recreation        |(2009–2015) |    8.6     |     -      |   178220   |     -      |     -      |   22 min   |            ['Comedy']             |   The absurd antics of an [...]   |
     One Punch Man: Wanpanman      |  (2015– )  |    8.9     |     -      |   87166    |     -      |     -      |   24 min   | ['Animation', ' Action', ' [...]  |The story of Saitama, a hero [...] |
             The Boys              |  (2019– )  |     -      |     -      |     -      |     -      |     -      |   60 min   |  ['Action', ' Comedy', ' Crime']  |A group of vigilantes set out [...]|
     Pokémon Detective Pikachu     |   (2019)   |    6.8     |     53     |   65217    |142,692,000 |     PG     |  104 min   |['Action', ' Adventure', ' Comedy']|   In a world where people [...]   |
    Kuidas ma kohtasin teie ema    |(2005–2014) |    8.3     |     -      |   544472   |     -      |     -      |   22 min   |      ['Comedy', ' Romance']       |  A father recounts to his [...]   |
 It's Always Sunny in Philadelphia |  (2005– )  |    8.7     |     -      |   171517   |     -      |     -      |   22 min   |            ['Comedy']             | Five friends with big egos [...]  |
              Stuber               |   (2019)   |    5.9     |     53     |    794     |     -      |     R      |   93 min   |       ['Action', ' Comedy']       |A detective recruits his Uber [...]|
          Moodne perekond          |  (2009– )  |    8.4     |     -      |   314178   |     -      |     -      |   22 min   |      ['Comedy', ' Romance']       | Three different but related [...] |
       The Umbrella Academy        |  (2019– )  |    8.1     |     -      |   73654    |     -      |     -      |   60 min   |['Action', ' Adventure', ' Comedy']|    A disbanded group of [...]     |
              Happy!               |(2017–2019) |    8.3     |     -      |   25284    |     -      |     -      |   60 min   |  ['Action', ' Comedy', ' Crime']  | An injured hitman befriends [...] |
          Rick and Morty           |  (2013– )  |    9.3     |     -      |   279411   |     -      |     -      |   23 min   |['Animation', ' Adventure', ' [...]|   An animated series that [...]   |
             Cobra Kai             |  (2018– )  |    8.8     |     -      |   34069    |     -      |     -      |   30 min   |  ['Action', ' Comedy', ' Drama']  |Decades after their 1984 All [...] |
          Roheline raamat          |   (2018)   |    8.2     |     69     |   215069   | 85,080,171 |   PG-13    |  130 min   |['Biography', ' Comedy', ' Drama'] |  A working-class Italian- [...]   |
              Kondid               |(2005–2017) |    7.9     |     -      |   130232   |     -      |     -      |   40 min   |  ['Comedy', ' Crime', ' Drama']   | Forensic anthropologist Dr. [...] |
           Sex Education           |  (2019– )  |    8.4     |     -      |   68509    |     -      |     -      |   45 min   |       ['Comedy', ' Drama']        |  A teenage boy with a sex [...]   |