如何从imdb业务页面抓取数据?

时间:2014-12-27 08:06:24

标签: python imdb

我正在制作一个需要来自imdb商业页面的数据的项目。我正在使用python。数据存储在两个标签之间,如下所示:

预算

$ 220,000,000(估计)

我想要数字量但到目前为止还没有成功。有什么建议。

4 个答案:

答案 0 :(得分:2)

看看Beautiful Soup,它是一个有用的抓取库。如果您查看源代码,“预算”位于h4元素内,并且值在DOM中是下一个。这可能不是最好的例子,但它适用于您的情况:

import urllib
from bs4 import BeautifulSoup


page = urllib.urlopen('http://www.imdb.com/title/tt0118715/?ref_=fn_al_nm_1a')
soup = BeautifulSoup(page.read())
for h4 in soup.find_all('h4'):
    if "Budget:" in h4:
        print h4.next_sibling.strip()

# $15,000,000

答案 1 :(得分:1)

这是一大堆代码(你可以在这里找到你的要求) 以下Python脚本将为您提供,1)来自IMDb的顶级票房电影列表 2)以及每个演员列表他们。

from lxml.html import parse

def imdb_bo(no_of_movies=5):
    bo_url = 'http://www.imdb.com/chart/'
    bo_page = parse(bo_url).getroot()
    bo_table = bo_page.cssselect('table.chart')
    bo_total = len(bo_table[0][2])

    if no_of_movies <= bo_total:
        count = no_of_movies
    else:
        count = bo_total

    movies = {}

    for i in range(0, count):
        mo = {}
        mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
        mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
        mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
        mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
        mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
        mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()

        m_page = parse(mo['url']).getroot()
        m_casttable = m_page.cssselect('table.cast_list')

        flag = 0
        mo['cast'] = []
        for cast in m_casttable[0]:
            if flag == 0:
                flag = 1
            else:
                m_starname = cast[1][0][0].text_content().strip()
                mo['cast'].append(m_starname)

        movies[i] = mo

    return movies


if __name__ == '__main__':

    no_of_movies = raw_input("Enter no. of Box office movies to display:")
    bo_movies = imdb_bo(int(no_of_movies))

    for k,v in bo_movies.iteritems():
        print '#'+str(k+1)+'  '+v['title']+' ('+v['year']+')'
        print 'URL: '+v['url']
        print 'Weekend: '+v['weekend']
        print 'Gross: '+v['gross']
        print 'Weeks: '+v['weeks']
        print 'Cast: '+', '.join(v['cast'])
        print '\n'


输出(在终端中运行):

parag@parag-innovate:~/python$ python imdb_bo_scraper.py 
Enter no. of Box office movies to display:3
#1  Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden


#2  Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski


#3  Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

答案 2 :(得分:0)

你问python,你问了一个刮擦解决方案。

但是没有必要使用python,也不需要抓取任何内容,因为预算数据可以在http://www.imdb.com/interfaces

上的business.list文本文件中找到。

答案 3 :(得分:0)

尝试IMDbPYits documentation。要安装,只需pip install imdbpy

from imdb import IMDb
ia = IMDb()
movie = ia.search_movie('The Untouchables')[0]
ia.update(movie)

#Lots of info for the movie from IMDB
movie.keys()

虽然我不确定在哪里可以找到具体的预算信息