是否有可能获得符合搜索条件的标题的所有IMDb ID(例如投票数,语言,发行年份等)?
我的首要任务是编制一份所有IMDb ID列表,这些ID被归类为故事片,并且有超过25,000张选票(相当于那些符合条件的人出现在前250名列表中),因为它显示为here。在发布时,有4,296部电影符合该标准。
(如果您不熟悉IMDb ID:它是与数据库中每个电影/人物/角色等相关联的唯一7位数代码。例如,对于电影"Drive" (2011),IMDb ID是“0780504
”。)
但是,在将来,按照我认为合适的方式设置搜索条件会很有帮助,因为我可以在输入网址时使用& num_votes = ##,& year = ##,& title_type = ##,...)
我一直在使用IMDBpy取得巨大成功来获取有关各个电影片目的信息,如果我描述的这个搜索功能可以通过该库访问,我会很高兴。
到目前为止,我一直在生成随机的7位数字符串并进行测试以确定它们是否符合我的标准,但这样做效率会很低,因为我会浪费处理时间在多余的ID上。
from imdb import IMDb, IMDbError
import random
i = IMDb(accessSystem='http')
movies = []
for _ in range(11000):
randID = str(random.randint(0, 7221897)).zfill(7)
movies.append(randID)
for m in movies:
try:
movie = i.get_movie(m)
except IMDbError as err:
print(err)`
if str(movie)=='':
continue
kind = movie.get('kind')
if kind != 'movie':
continue
votes=movie.get('votes')
if votes == None:
continue
if votes>=25000:
答案 0 :(得分:1)
看看http://www.omdbapi.com/ 您可以直接使用API,按标题或ID进行搜索。
在python3中
import urllib.request
urllib.request.urlopen("http://www.omdbapi.com/?apikey=27939b55&s=moana").read()
答案 1 :(得分:0)
这是我的代码:
from requests import get
from bs4 import BeautifulSoup
import re
import math
from time import time, sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page=1&ref_=adv_nxt"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
num_films_text = html_soup.find_all('div', class_ = 'desc')
num_films=re.search('of (\d.+) titles',str(num_films_text[0])).group(1)
num_films=int(num_films.replace(',', ''))
print(num_films)
num_pages = math.ceil(num_films/50)
print(num_pages)
ids = []
start_time = time()
requests = 0
# For every page in the interval`
for page in range(1,num_pages+1):
# Make a get request
url = "http://www.imdb.com/search/title?num_votes=25000,&title_type=feature&view=simple&sort=num_votes,desc&page="+str(page)+"&ref_=adv_nxt"
response = get(url)
# Pause the loop
sleep(randint(8,15))
# Monitor the requests
requests += 1
sleep(randint(1,3))
elapsed_time = time() - start_time
print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
# Break the loop if the number of requests is greater than expected
if requests > num_pages:
warn('Number of requests was greater than expected.')
break
# Parse the content of the request with BeautifulSoup
page_html = BeautifulSoup(response.text, 'html.parser')
# Select all the 50 movie containers from a single page
movie_containers = page_html.find_all('div', class_ = 'lister-item mode-simple')
# Scrape the ID
for i in range(len(movie_containers)):
id = re.search('tt(\d+)/',str(movie_containers[i].a)).group(1)
ids.append(id)
print(ids)