我在图片代码中有一个包含多个ID的字符串:
<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" />
soup = bs4.BeautifulSoup(webpage,"html.parser")
images = soup.findAll('img')
for image in images:
print image
以上代码仅返回id=comp-jefxldtzbalatamediacontentimage
更换
soup = bs4.BeautifulSoup(webpage,"html.parser")
与
soup = bs4.BeautifulSoup(webpage,"lxml")
返回第一个id webfast-uhyubv
但是,我希望按照输入行所存在的顺序获取id。
答案 0 :(得分:1)
BeautifulSoup存储attributes of a tag in a dictionary。由于字典不能具有重复键,因此一个id
属性会覆盖另一个。您可以使用tag.attrs
检查属性字典。
>>> soup = BeautifulSoup(tag, 'html.parser')
>>> soup.img.attrs
{'id': 'comp-jefxldtzbalatamediacontentimage', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}
>>> soup = BeautifulSoup(tag, 'lxml')
>>> soup.img.attrs
{'id': 'webfast-uhyubv', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}
如您所见,我们使用不同的解析器获得id
的不同值。这发生在different parsers work differently。
使用BeautifulSoup无法同时获取id
值。您可以使用RegEx获取它们。但是,use it carefully and as a last resort!
>>> import re
>>> tag = '<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" />'
>>> ids = re.findall('id="(.*?)"', tag)
>>> ids
['webfast-uhyubv', 'comp-jefxldtzbalatamediacontentimage']