StopIteration错误?在网页抓取imdb页面期间

时间:2015-03-07 08:01:04

标签: python web-scraping beautifulsoup

我正在尝试WebScrape url =' http://m.imdb.com/feature/bornondate'这个页面上显示的10个名人的名字。但是Python说StopIteration并没有打印我的结果。

这是我的代码,我认为它解释了我想要做的事情。

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Celebrities
count = 0
# Fetching the value present within tag results
celebrities = soup.findChildren('section', 'posters list')
# Changing the celebrity into an iterator
itercelebrity = iter(celebrities[0].findChildren('a'))
# Skipping the first value of the iterator as it does have the required info
next(itercelebrity)

# Finding a in itercelebrity. Every a tag contains information of a celebrity
for a in itercelebrity:

    celebrity = tr.findChildren('div', 'label')
    name = celebrity[0].find('span', 'title').contents[0]

    print '*******************************IMDB CELEBRITYS***********************************'
    # Printing the Name of the celebrity
    print 'Name --> ' + name

这是输出(它不打印任何东西)

Patricks-MacBook-Pro:~ Patrick$ python /Users/Patrick/Desktop/IMDB_BornToday_Scraping.py
Traceback (most recent call last):
  File "/Users/Patrick/Desktop/IMDB_BornToday_Scraping.py", line 20, in <module>
    next(itercelebrity)
StopIteration
Patricks-MacBook-Pro:~ Patrick$ 

如果你现在不能说,我对这个很新:) 这是我想要的相关html

<section class="posters list">
<h1>March 7</h1>

<a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a><a href="/name/nm0696059/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1._CR156,0,1736,2560_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Laura Prepon</span><div class="detail">Actress, "Karla"</div></div></a><a href="/name/nm0001838/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1._CR5,0,271,400_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Rachel Weisz</span><div class="detail">Actress, "The Mummy"</div></div></a><a href="/name/nm0765597/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1._CR19,0,271,400_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Peter Sarsgaard</span><div class="detail">Actor, "Jarhead"</div></div></a><a href="/name/nm0278979/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1._CR24,0,271,400_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Jenna Fischer</span><div class="detail">Actress, "Blades of Glory"</div></div></a><a href="/name/nm0614220/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1._CR26,0,488,720_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Donna Murphy</span><div class="detail">Actress, "Tangled"</div></div></a><a href="/name/nm0862328/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1._CR33,0,235,346_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">T.J. Thyne</span><div class="detail">Actor, "How the Grinch Stole Christmas"</div></div></a><a href="/name/nm0001334/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1._CR41,0,368,543_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">John Heard</span><div class="detail">Actor, "Home Alone"</div></div></a><a href="/name/nm1017524/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1._CR0,0,3644,5375_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Audrey Marie Anderson</span><div class="detail">Actress, "Beerfest"</div></div></a><a href="/name/nm0891216/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1._CR0,0,266,392_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Matthew Vaughn</span><div class="detail">Producer, "Kick-Ass"</div></div></a><div class="paginator"><a class="next" data-start="10" href="#page10">Show more...</a><a class="seeAll" href="#showAll">See all</a></div></section>

1 个答案:

答案 0 :(得分:1)

错误来自:

celebrities[0].findChildren('a')

没有结果会导致迭代器与您执行的操作相同:

it = iter([])
next(it)

导致相同的例外:

>>> it = iter([])
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

更好的方法是使用soup.select()方法的CSS选择器。这将打印所有名称

for name in soup.select("section.posters.list a.poster div.label span.title"):
    print name.string

将打印所有名称。选择器可能过于具体。


但这不起作用,我已经弄明白了为什么。查看从获取页面返回的HTML:

<section class="posters list">
<h1>&nbsp;</h1>
<span class="loading"></span>
</section>

section的内容无法提取。它们由AJAX请求加载。这是从:

发出的
<script language="javascript" type="text/javascript">
$(document).ready(function() {
    var pagination = $('section.posters').itemPagination(10)
    var now = new Date();

    var client = new IMDbClient();
    client.useSessionCache(true);
    client.call('/feature/bornondate_json?today='+now.toYYYYMMDD(), function(data) {
        pagination(data.list);
    });
    var months = ['January','February','March','April','May','June','July','August','September','October','November','December'];
    $('section.posters > h1').html( months[now.getMonth()] + ' ' + now.getDate() );
});
</script>

如果要提取数据,最好的选择是使用浏览器驱动程序,例如Selenium