Python - BeautifulSoup - 如何检查ResultSet是否包含元素

时间:2013-09-05 20:12:36

标签: python beautifulsoup

我正在做一些网络抓取,但我想出了一些我无法弄清楚的东西。基本上,我需要检查我的ResultSet元素releaseDate的第0个元素是否包含'content',如

[<meta content="1992-09-11" itemprop="datePublished"/>]

但是当'content'不在标签中时,我会收到类似

的错误
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "imdbQuestion.py", line 18, in <module>
if releaseDate[0]['content']:
File "build/bdist.macosx-10.8-intel/egg/bs4/element.py", line 879, in __getitem__
KeyError: 'content'

如何在不导致错误的情况下检查'content'是否在releaseDate中?

此外,如何从ResultSet对象中提取我想要的任何内容?

完整的代码是:

import requests
from bs4 import BeautifulSoup

file = codecs.open('imdb.txt', 'w', encoding = 'utf-8')

#iterate through last value
for increment in range(7,10):
    imdbNum = '015008' + str(increment)
    url = 'http://www.imdb.com/title/tt' + imdbNum

    urlCode = requests.get(url)
    soup = BeautifulSoup(urlCode.content)

    #get release date
    releaseDate = soup.findAll(attrs={'itemprop':'datePublished'})
    abc = releaseDate
        #error checking - assign '.' to releaseDate if releaseDate[0] is blank
        #if not blank, check if 'content' is in releaseDate[0]. if so, we are good.  if not, assign 'CHECK' to releaseDate[0]
    if releaseDate:
        if releaseDate[0]['content']:
            releaseDate = releaseDate[0]['content']
        else:
            releaseDate = 'CHECK'
    else:
        releaseDate = '.'

    print releaseDate
    file.close()

1 个答案:

答案 0 :(得分:3)

针对Tag.attrs dictionary进行测试:

if releaseDate:
    if 'content' in releaseDate[0].attrs:
        releaseDate = releaseDate[0]['content']
    else:
        releaseDate = 'CHECK'

或对该属性使用dict.get()方法:

if releaseDate:
    releaseDate = releaseDate[0].attrs.get('content', 'CHECK')

快速演示:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> imdbNum = '0150087'
>>> url = 'http://www.imdb.com/title/tt' + imdbNum
>>> urlCode = requests.get(url)
>>> soup = BeautifulSoup(urlCode.content)
>>> releaseDate = soup.findAll(attrs={'itemprop':'datePublished'})
>>> releaseDate[0]
<meta content="1966-04" itemprop="datePublished"/>
>>> releaseDate[0].attrs
{'content': '1966-04', 'itemprop': 'datePublished'}
>>> 'content' in releaseDate[0].attrs
True
>>> releaseDate[0].attrs.get('content', 'CHECK')
'1966-04'