我想从一个列出了2018年前50部电影的页面上的IMDB
中删除导演和演员。我的问题是我不知道如何删除它们,因为班级没有名字。
'''part of my code which is working fine'''
response = requests.get('https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1')
for i in soup.find_all('div', class_ = 'lister-item-content'):
film_lenght = film_details.find('span', class_='runtime').text
film_genre = film_details.find('span', class_='genre').text
public_rating = i.find('div', class_='ratings-bar').strong.text
'''part of the HTML code that I don't know how to work with'''
</p>, <p class="">
Directors:
<a href="/name/nm0751577/">Anthony Russo</a>,
<a href="/name/nm0751648/">Joe Russo</a>
<span class="ghost">|</span>
Stars:
<a href="/name/nm0000375/">Robert Downey Jr.</a>,
<a href="/name/nm1165110/">Chris Hemsworth</a>,
<a href="/name/nm0749263/">Mark Ruffalo</a>,
<a href="/name/nm0262635/">Chris Evans</a>
</p>]
我希望能够吸引每部电影的所有导演和所有列出的演员。我想通过代码中提供的单个URL来做到这一点
答案 0 :(得分:1)
您可以使用:contains
并指定Director:
或Directors:
来定位每部电影的块;然后通过在a
标签之前抓取span
标签来分离导演(通过过滤掉后面的标签)。角色将是a
标签的常规span
标签同级。需要bs4 v 4.7.1
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1')
soup = bs(r.content, 'lxml')
for item in soup.select('p:contains("Director:"), p:contains("Directors:")'):
#print(item)
directors = [d.text for d in item.select('a:not(span ~ a)')]
actors = [d.text for d in item.select('span ~ a')]
print(directors, actors)
答案 1 :(得分:1)
QHarr的回答很好,但后来我注意到有些电影根本没有列出导演。在这种情况下,代码会忽略这些影片。因此,我更新了QHarr的代码,现在将这种情况考虑在内:
'''
for item in soup.select('p:contains("Stars:")'):
reqs += 1
if item not in soup.select('p:contains("Director:"), p:contains("Directors:")'):
actors = [d.text for d in item.select('a:not(span ~ a)')]
directors = ['none']
else:
directors = str([d.text for d in item.select('a:not(span ~ a)')]).strip('[]').replace("'","")
actors = [d.text for d in item.select('span ~ a')]
'''