BeautifulSoup中的findAll()跳过多个ID

时间:2018-05-17 19:10:51

标签: python beautifulsoup html-parsing

我在图片代码中有一个包含多个ID的字符串:

<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" /> 

soup = bs4.BeautifulSoup(webpage,"html.parser")
images = soup.findAll('img')
for image in images:
    print image

以上代码仅返回id=comp-jefxldtzbalatamediacontentimage

更换

soup = bs4.BeautifulSoup(webpage,"html.parser")

soup = bs4.BeautifulSoup(webpage,"lxml")

返回第一个id webfast-uhyubv

但是,我希望按照输入行所存在的顺序获取id。

1 个答案:

答案 0 :(得分:1)

BeautifulSoup存储attributes of a tag in a dictionary。由于字典不能具有重复键,因此一个id属性会覆盖另一个。您可以使用tag.attrs检查属性字典。

>>> soup = BeautifulSoup(tag, 'html.parser')
>>> soup.img.attrs
{'id': 'comp-jefxldtzbalatamediacontentimage', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

>>> soup = BeautifulSoup(tag, 'lxml')
>>> soup.img.attrs
{'id': 'webfast-uhyubv', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

如您所见,我们使用不同的解析器获得id的不同值。这发生在different parsers work differently

使用BeautifulSoup无法同时获取id值。您可以使用RegEx获取它们。但是,use it carefully and as a last resort!

>>> import re
>>> tag = '<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" />'
>>> ids = re.findall('id="(.*?)"', tag)
>>> ids
['webfast-uhyubv', 'comp-jefxldtzbalatamediacontentimage']