Question

我正在关注这些教程http://importpython.blogspot.com/2009/12/how-to-get-beautifulsoup-to-filter.html和http://importpython.blogspot.com/2009/12/how-to-screen-scrape-craigslist-using.html，即使使用复制粘贴的代码，我似乎无法获得要打印的链接的标题，因为我得到了一个列表索引分别超出第11和第8行的范围。如果我要复制代码，我做错了什么。我尝试了其他变体，例如只返回链接，并且完全正常，所以我不认为这是一个本地问题

修改

问题是以下代码（来自http://importpython.blogspot.com/2009/12/how-to-screen-scrape-craigslist-using.html）：

from BeautifulSoup import BeautifulSoup   #1
from urllib2 import urlopen               #2

site = "http://sfbay.craigslist.org/rea/" #3
html = urlopen(site)                      #4
soup = BeautifulSoup(html)                #5
postings = soup('p')                      #6

for post in postings:                     #7
    print post('a')[0].contents[0]        #8
    print post('a')[0]['href']            #9

给出错误：

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
IndexError: list index out of range

Answer 1

这依赖于Craigslist的html结构，该结构已经改变。您将在第二个“a”标记中获得“正确”的结果：

print post('a')[1].contents[0]
print post('a')[1]['href']

Answer 2

BeautifulSoup非常强大......所以不要懒惰并充分利用它的力量：

soup = BeautifulSoup(html)
postings = soup.find_all('p', {'class': 'row'})

for post in postings:
   info_container = post.find('span', {'class':'pl'}).find('a')
   print info_container.text
   print info_container['href']

我总是试图避免在我的代码中硬编码数组大小。并使用find功能，它是最直观的

Beautifulsoup内容返回列表索引超出范围

2 个答案: