如何使用BeautifulSoup查找特定标签

时间:2015-01-07 06:17:53

标签: python html beautifulsoup html-parsing

我在这里有源HTML http://pastebin.com/rxK0mnVj。我想在Image标签中检查源包含blz-src属性,并检查src是否包含数据uri然后返回true或false。

例如,

<img src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAQAICRAEAOw==" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>

应返回False,因为data-blzsrc属性存在,但src属性包含data:

但是,

<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>

应该返回True,因为它包含data-blzsrc属性且src不包含data:

如何在BeautifulSoup中实现这一目标。

2 个答案:

答案 0 :(得分:1)

如果您想查找所有img代码并对其进行测试,请使用find_all()并检查属性,例如:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('index.html'))

def check_img(img):
    return 'data-blzsrc' in img.attrs and 'data' not in img.get('src', '')

for img in soup.find_all('img'):
    print img, check_img(img)

如果要过滤掉符合条件的图像,可以将attrs参数传递给提供字典的find_all()。将data-blzsrc设置为True以强制它存在,使用函数检查src的值是否包含data

for img in soup.find_all('img', attrs={'data-blzsrc': True, 'src': lambda x: x and 'data' not in x}):
    print img

答案 1 :(得分:0)

尝试查找所有图像,并检查是否存在所需的attr并检查src属性内容。看看这个剧本:

from bs4 import BeautifulSoup
html = """
<img src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAQAICRAEAOw==" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
"""

soup = BeautifulSoup(html)
for img in soup.findAll('img'):
    #here is your desired conditions
    if img.has_attr('data-blzsrc') and not img.attrs.get('src','').startswith('data:'):
        print img

它打印所需的img节点

<img alt="StrawberryNET" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" height="60" src="http://images.akam.net/img1.jpg" width="324"/>