使用来自imdb网站的python topboxoffice列表进行html抓取

时间:2014-05-30 06:46:04

标签: python html

  

网址:http://www.imdb.com/chart/?ref_=nv_ch_cht_2

我希望您从上面的网站打印顶级票房列表(所有电影'排名,标题,周末,总票房和周电影的顺序)

示例输出:
等级:1
标题:godzilla
周末:$ 93.2M
毛:$ 93.2M
周:1 等级:2
标题:邻居

3 个答案:

答案 0 :(得分:1)

这只是通过BeautifulSoup

提取这些实体的简单方法
from bs4 import BeautifulSoup                                          
import urllib2                                                         

url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"                    

data = urllib2.urlopen(url).read()                                     
page = BeautifulSoup(data, 'html.parser')                              

rows = page.findAll("tr", {'class': ['odd', 'even']}) 

for tr in rows:             
    for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
        print data.get_text()

P.S.-根据您的意愿安排。

答案 1 :(得分:0)

没有必要刮掉任何东西。看看我在这里给出的答案。

How to scrape data from imdb business page?

答案 2 :(得分:0)

以下Python脚本将为您提供,1)来自IMDb的顶级票房电影列表 2)还有演员名单为他们每个人。

from lxml.html import parse

def imdb_bo(no_of_movies=5):
    bo_url = 'http://www.imdb.com/chart/'
    bo_page = parse(bo_url).getroot()
    bo_table = bo_page.cssselect('table.chart')
    bo_total = len(bo_table[0][2])

    if no_of_movies <= bo_total:
        count = no_of_movies
    else:
        count = bo_total

    movies = {}

    for i in range(0, count):
        mo = {}
        mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
        mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
        mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
        mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
        mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
        mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()

        m_page = parse(mo['url']).getroot()
        m_casttable = m_page.cssselect('table.cast_list')

        flag = 0
        mo['cast'] = []
        for cast in m_casttable[0]:
            if flag == 0:
                flag = 1
            else:
                m_starname = cast[1][0][0].text_content().strip()
                mo['cast'].append(m_starname)

        movies[i] = mo

    return movies


if __name__ == '__main__':

    no_of_movies = raw_input("Enter no. of Box office movies to display:")
    bo_movies = imdb_bo(int(no_of_movies))

    for k,v in bo_movies.iteritems():
        print '#'+str(k+1)+'  '+v['title']+' ('+v['year']+')'
        print 'URL: '+v['url']
        print 'Weekend: '+v['weekend']
        print 'Gross: '+v['gross']
        print 'Weeks: '+v['weeks']
        print 'Cast: '+', '.join(v['cast'])
        print '\n'


输出(在终端中运行):

parag@parag-innovate:~/python$ python imdb_bo_scraper.py 
Enter no. of Box office movies to display:3
#1  Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden


#2  Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski


#3  Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long