我正在尝试抓取IMDB列表,但是目前我在表中打印出的所有内容都是第一部电影(玩具总动员)。
我尝试初始化count = 0,然后尝试在for循环结束时更新first_movie = movie_containers [count + 1],但它不起作用。无论我尝试什么,都会遇到各种错误,例如“数组的长度必须相同”。如我所说,当它起作用时,只有页面上的第一部电影被打印到表中50次。
from bs4 import BeautifulSoup
from requests import get
import pandas as pd
url = 'https://www.imdb.com/search/title/?genres=comedy&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=3PWY0EZBAKM22YP2F114&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_1'
response = get(url)
html = BeautifulSoup(response.text, 'lxml')
movie_containers = html.find_all('div', class_='lister-item mode-advanced')
first_movie = movie_containers[0]
name = first_movie.h3.a.text
year = first_movie.find('span', class_='lister-item-year text-muted unbold').text
rating = float(first_movie.find('div', class_='inline-block ratings-imdb-rating').text.strip())
metascore = int(first_movie.find('span', class_='metascore favorable').text)
vote = first_movie.find('span', attrs={'name':'nv'})
vote = vote['data-value']
gross = first_movie.find('span', attrs={'data-value':'272,257,544'})
gross = '$' + gross['data-value']
info_container = first_movie.findAll('p', class_='text-muted')[0]
certificate = info_container.find('span', class_='certificate').text
runtime = info_container.find('span', class_='runtime').text
genre = info_container.find('span', class_='genre').text.strip()
description = first_movie.findAll('p', class_='text-muted')[1].text.strip()
#second_movie_metascore = movie_containers[1].find('div', class_='ratings-metascore')
names = []
years = []
ratings = []
metascores = []
votes = []
grossing = []
certificates = []
runtimes = []
genres = []
descriptions = []
for container in movie_containers:
try:
name = first_movie.h3.a.text
names.append(name)
except:
continue
try:
year = first_movie.find('span', class_='lister-item-year text-muted unbold').text
years.append(year)
except:
continue
try:
rating = float(first_movie.find('div', class_='inline-block ratings-imdb-rating').text.strip())
ratings.append(rating)
except:
continue
try:
metascore = int(first_movie.find('span', class_='metascore favorable').text)
metascores.append(metascore)
except:
continue
try:
vote = first_movie.find('span', attrs={'name':'nv'})
vote = vote['data-value']
votes.append(vote)
except:
continue
try:
gross = first_movie.find('span', attrs={'data-value':'272,257,544'})
gross = '$' + gross['data-value']
grossing.append(gross)
except:
continue
try:
certificate = info_container.find('span', class_='certificate').text
certificates.append(certificate)
except:
continue
try:
runtime = info_container.find('span', class_='runtime').text
runtimes.append(runtime)
except:
continue
try:
genre = info_container.find('span', class_='genre').text.strip()
genres.append(genre)
except:
continue
try:
description = first_movie.findAll('p', class_='text-muted')[1].text.strip()
descriptions.append(description)
except:
continue
test_df = pd.DataFrame({'Movie': names,
'Year': years,
'IMDB': ratings,
'Metascore': metascores,
'Votes': votes,
'Gross': grossing,
'Certificate': certificates,
'Runtime': runtimes,
'Genres': genres,
'Descriptions': descriptions
})
#print(test_df.info())
print(test_df)
此外,当打印出表格时,如何从1开始而不是从0开始pd列表?
答案 0 :(得分:1)
您可以尝试使用此代码来抓取数据。我现在将其打印在屏幕上,但是您会将数据放入Panda的数据框中:
from bs4 import BeautifulSoup
import requests
import textwrap
url = 'https://www.imdb.com/search/title/?genres=comedy&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=3PWY0EZBAKM22YP2F114&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
names = []
years = []
ratings = []
metascores = []
votes = []
grossing = []
certificates = []
runtimes = []
genres = []
descriptions = []
for i in soup.select('.lister-item-content'):
for t in i.select('h3 a'):
names.append(t.text)
break
else:
names.append('-')
for t in i.select('.lister-item-year'):
years.append(t.text)
break
else:
years.append('-')
for t in i.select('.ratings-imdb-rating'):
ratings.append(t.text.strip())
break
else:
ratings.append('-')
for t in i.select('.metascore'):
metascores.append(t.text.strip())
break
else:
metascores.append('-')
for t in i.select('.sort-num_votes-visible span:contains("Votes:") + span[data-value]'):
votes.append(t['data-value'])
break
else:
votes.append('-')
for t in i.select('.sort-num_votes-visible span:contains("Gross:") + span[data-value]'):
grossing.append(t['data-value'])
break
else:
grossing.append('-')
for t in i.select('.certificate'):
certificates.append(t.text.strip())
break
else:
certificates.append('-')
for t in i.select('.runtime'):
runtimes.append(t.text.strip())
break
else:
runtimes.append('-')
for t in i.select('.genre'):
genres.append(t.text.strip().split(','))
break
else:
genres.append('-')
for t in i.select('p.text-muted')[1:2]:
descriptions.append(t.text.strip())
break
else:
descriptions.append('-')
for row in zip(names, years, ratings, metascores, votes, grossing, certificates, runtimes, genres, descriptions):
for col_num, data in enumerate(row):
if col_num == 0:
t = textwrap.shorten(str(data), 35)
print('{: ^35}'.format(t), end='|')
elif col_num in (1, 2, 3, 4, 5, 6, 7):
t = textwrap.shorten(str(data), 12)
print('{: ^12}'.format(t), end='|')
else:
t = textwrap.shorten(str(data), 35)
print('{: ^35}'.format(t), end='|')
print()
打印:
Toy Story 4 | (2019) | 8.3 | 84 | 50496 |272,257,544 | G | 100 min |['Animation', ' Adventure', ' [...]|When a new toy called "Forky" [...]|
Charlie's Angels | (2019) | - | - | - | - | - | - |['Action', ' Adventure', ' Comedy']| Reboot of the 2000 action [...] |
Murder Mystery | (2019) | 6.0 | 38 | 46255 | - | PG-13 | 97 min | ['Action', ' Comedy', ' Crime'] | A New York cop and his wife [...] |
Eile veel |(III) (2019)| 7.1 | 56 | 10539 | 26,132,740 | PG-13 | 116 min | ['Comedy', ' Fantasy', ' Music'] | A struggling musician [...] |
Mehed mustas: globaalne oht | (2019) | 5.7 | 38 | 24338 | 66,894,949 | PG-13 | 114 min |['Action', ' Adventure', ' Comedy']|The Men in Black have always [...] |
Good Omens | (2019) | 8.3 | - | 24804 | - | - | 60 min | ['Comedy', ' Fantasy'] | A tale of the bungling of [...] |
Ükskord Hollywoodis | (2019) | 9.6 | 88 | 6936 | - | - | 159 min | ['Comedy', ' Drama'] |A faded television actor and [...] |
Aladdin | (2019) | 7.4 | 53 | 77230 |313,189,616 | PG | 128 min |['Adventure', ' Comedy', ' Family']|A kind-hearted street urchin [...] |
Mr. Iglesias | (2019– ) | 7.2 | - | 2266 | - | - | 30 min | ['Comedy'] | A good-natured high school [...] |
Shazam! | (2019) | 7.3 | 70 | 129241 |140,105,000 | PG-13 | 132 min |['Action', ' Adventure', ' Comedy']| We all have a superhero [...] |
Shaft | (2019) | 6.4 | 40 | 12016 | 19,019,975 | R | 111 min | ['Action', ' Comedy', ' Crime'] | John Shaft Jr., a cyber [...] |
Kontor |(2005–2013) | 8.8 | - | 301620 | - | - | 22 min | ['Comedy'] |A mockumentary on a group of [...] |
Sõbrad |(1994–2004) | 8.9 | - | 683205 | - | - | 22 min | ['Comedy', ' Romance'] | Follows the personal and [...] |
Lelulugu | (1995) | 8.3 | 95 | 800957 |191,796,233 | - | 81 min |['Animation', ' Adventure', ' [...]| A cowboy doll is profoundly [...] |
Lelulugu 3 | (2010) | 8.3 | 92 | 689098 |415,004,880 | - | 103 min |['Animation', ' Adventure', ' [...]| The toys are mistakenly [...] |
Orange Is the New Black | (2013– ) | 8.1 | - | 256417 | - | - | 59 min | ['Comedy', ' Crime', ' Drama'] | Convicted of a decade old [...] |
Brooklyn Nine-Nine | (2013– ) | 8.4 | - | 154342 | - | - | 22 min | ['Comedy', ' Crime'] | Jake Peralta, an immature, [...] |
Always Be My Maybe | (2019) | 6.9 | 64 | 26210 | - | PG-13 | 101 min | ['Comedy', ' Romance'] | A pair of childhood friends [...] |
The Dead Don't Die | (2019) | 6.0 | 54 | 6841 | 6,116,830 | R | 104 min | ['Comedy', ' Fantasy', ' Horror'] | The peaceful town of [...] |
Suure Paugu teooria |(2007–2019) | 8.2 | - | 653122 | - | - | 22 min | ['Comedy', ' Romance'] | A woman who moves into an [...] |
Lelulugu 2 | (1999) | 7.9 | 88 | 476104 |245,852,179 | - | 92 min |['Animation', ' Adventure', ' [...]|When Woody is stolen by a toy [...]|
Fast & Furious Presents: [...] | (2019) | - | - | - | - | PG-13 | - |['Action', ' Adventure', ' Comedy']|Lawman Luke Hobbs and outcast [...]|
Dead to Me | (2019– ) | 8.2 | - | 23149 | - | - | 30 min | ['Comedy', ' Drama'] | A series about a powerful [...] |
Pintsaklipslased | (2011– ) | 8.5 | - | 328568 | - | - | 44 min | ['Comedy', ' Drama'] | On the run from a drug deal [...] |
The Secret Life of Pets 2 | (2019) | 6.6 | 55 | 8613 |135,983,335 | PG | 86 min |['Animation', ' Adventure', ' [...]| Continuing the story of Max [...] |
Good Girls | (2018– ) | 7.9 | - | 18518 | - | - | 43 min | ['Comedy', ' Crime', ' Drama'] | Three suburban mothers [...] |
Ralph Breaks the Internet | (2018) | 7.1 | 71 | 91165 |201,091,711 | PG | 112 min |['Animation', ' Adventure', ' [...]|Six years after the events of [...]|
Trolls 2 | (2020) | - | - | - | - | - | - |['Animation', ' Adventure', ' [...]| Sequel to the 2016 animated hit. |
Booksmart | (2019) | 7.4 | 84 | 24935 | 21,474,121 | R | 102 min | ['Comedy'] | On the eve of their high [...] |
The Old Man & the Gun | (2018) | 6.8 | 80 | 27337 | 11,277,120 | PG-13 | 93 min |['Biography', ' Comedy', ' Crime'] | Based on the true story of [...] |
Fleabag | (2016– ) | 8.6 | - | 25041 | - | - | 27 min | ['Comedy', ' Drama'] |A comedy series adapted from [...] |
Schitt's Creek | (2015– ) | 8.2 | - | 18112 | - | - | 22 min | ['Comedy'] |When rich video-store magnate [...]|
Catch-22 | (2019– ) | 7.9 | - | 6829 | - | - | 45 min | ['Comedy', ' Crime', ' Drama'] |Limited series adaptation of [...] |
Häbitu | (2011– ) | 8.7 | - | 171782 | - | - | 46 min | ['Comedy', ' Drama'] | A scrappy, fiercely loyal [...] |
Jane the Virgin | (2014– ) | 7.8 | - | 30106 | - | - | 60 min | ['Comedy'] | A young, devout Catholic [...] |
Parks and Recreation |(2009–2015) | 8.6 | - | 178220 | - | - | 22 min | ['Comedy'] | The absurd antics of an [...] |
One Punch Man: Wanpanman | (2015– ) | 8.9 | - | 87166 | - | - | 24 min | ['Animation', ' Action', ' [...] |The story of Saitama, a hero [...] |
The Boys | (2019– ) | - | - | - | - | - | 60 min | ['Action', ' Comedy', ' Crime'] |A group of vigilantes set out [...]|
Pokémon Detective Pikachu | (2019) | 6.8 | 53 | 65217 |142,692,000 | PG | 104 min |['Action', ' Adventure', ' Comedy']| In a world where people [...] |
Kuidas ma kohtasin teie ema |(2005–2014) | 8.3 | - | 544472 | - | - | 22 min | ['Comedy', ' Romance'] | A father recounts to his [...] |
It's Always Sunny in Philadelphia | (2005– ) | 8.7 | - | 171517 | - | - | 22 min | ['Comedy'] | Five friends with big egos [...] |
Stuber | (2019) | 5.9 | 53 | 794 | - | R | 93 min | ['Action', ' Comedy'] |A detective recruits his Uber [...]|
Moodne perekond | (2009– ) | 8.4 | - | 314178 | - | - | 22 min | ['Comedy', ' Romance'] | Three different but related [...] |
The Umbrella Academy | (2019– ) | 8.1 | - | 73654 | - | - | 60 min |['Action', ' Adventure', ' Comedy']| A disbanded group of [...] |
Happy! |(2017–2019) | 8.3 | - | 25284 | - | - | 60 min | ['Action', ' Comedy', ' Crime'] | An injured hitman befriends [...] |
Rick and Morty | (2013– ) | 9.3 | - | 279411 | - | - | 23 min |['Animation', ' Adventure', ' [...]| An animated series that [...] |
Cobra Kai | (2018– ) | 8.8 | - | 34069 | - | - | 30 min | ['Action', ' Comedy', ' Drama'] |Decades after their 1984 All [...] |
Roheline raamat | (2018) | 8.2 | 69 | 215069 | 85,080,171 | PG-13 | 130 min |['Biography', ' Comedy', ' Drama'] | A working-class Italian- [...] |
Kondid |(2005–2017) | 7.9 | - | 130232 | - | - | 40 min | ['Comedy', ' Crime', ' Drama'] | Forensic anthropologist Dr. [...] |
Sex Education | (2019– ) | 8.4 | - | 68509 | - | - | 45 min | ['Comedy', ' Drama'] | A teenage boy with a sex [...] |