Question

我正在将IMDbPY与公开可用的IMDb数据集（https://www.imdb.com/interfaces/）结合使用，以使用pandas创建自定义数据集。公开数据集包含许多重要信息，但据我所知，其中不包含绘图信息。 IMDbPY确实包含情节摘要，以及情节摘要和电影类/词典的情节，摘要和关键字键形式的情节关键字。

我可以通过进行API调用来获得各个键的图：ia.get_movie(movie_index[2:])['plot'][0]，在这里我使用[2：]，因为索引的前2个字符在公共数据集中是'tt'，而在[0]则因为有很多情节摘要，所以我要从IMDbPY中获取第一个。

但是，要获得10,000个绘图摘要，我需要进行10,000个API调用，这将花费我7.5个小时，假设每个API调用需要2.7秒（这是我在tqdm中找到的）。因此，解决方案是让它运行一整夜。还有其他解决方案吗？另外，还有一种比我目前的方法更好的方法，该方法是创建当前字典，将键作为影片索引（例如，“ Shawshank Redemption”使用tt0111161），将值作为绘图，然后将该字典转换为数据框？任何见解均表示赞赏。我的代码如下：

movie_dict = {}
for movie_index in tqdm(movies_index[0:10]):
    #movie = ia.get_movie(movie_index[2:])
    try:
        movie_dict[movie_index] = ia.get_movie(movie_index[2:])['plot'][0]
    except:
        movie_dict[movie_index] = ''

plots = pd.DataFrame.from_dict(movie_dict, orient='index')
plots.rename(columns={0:'plot'}, inplace=True)
plots


             plot
tt0111161   Two imprisoned men bond over a number of years...
tt0468569   When the menace known as the Joker emerges fro...
tt1375666   A thief who steals corporate secrets through t...
tt0137523   An insomniac office worker and a devil-may-car...
tt0110912   The lives of two mob hitmen, a boxer, a gangst...
tt0109830   The presidencies of Kennedy and Johnson, the e...
tt0120737   A meek Hobbit from the Shire and eight compani...
tt0133093   A computer hacker learns from mysterious rebel...
tt0167260   Gandalf and Aragorn lead the World of Men agai...
tt0068646   The aging patriarch of an organized crime dyna...

Answer 1

首先，请考虑在极短的时间内进行如此多的查询会违反其服务条款：https://www.imdb.com/conditions

但是，对一个主要网站进行的10,000个查询并不会产生任何实际的问题，尤其是如果您在每次调用之间等待几秒钟只是为了变得更好（这会花费更长的时间，但这并不重要）。您的情况-但再次请参见上文，必须遵守许可规定。

我可以建议两个不同的选择：

使用旧数据集，该数据集可免费用于个人和非商业用途，并且IMDbPY能够解析；缺点是数据有点过时（到2017年底）：https://imdbpy.readthedocs.io/en/latest/usage/ptdf.html
使用其他来源，例如https://www.omdbapi.com/或https://www.themoviedb.org/，这些来源应具有公共API和更宽松的许可。

免责声明：我是IMDbPY的主要作者之一

使用IMDbPY获取10,000个电影情节

1 个答案: