使用BeautifulSoup选择特定标签

时间:2012-02-12 23:21:56

标签: python beautifulsoup

我使用这段代码获取一些带有BeautifulSoup的html表行:

from bs4 import BeautifulSoup
import urllib2
import re

page = urllib2.urlopen('www.something.bla')
soup = BeautifulSoup(page)
rows = soup.findAll('tr', attrs={'class': re.compile('class1.*')})

这就是我得到的结果:

<tr class="class1 class2 class3">...</tr>
<tr class="class1 class2 class3">...</tr>
<tr class="class1 class5">...</tr>
<tr class="class1_a class5_a">...</tr>
<tr class="class1 class5">...</tr>
<tr class="class1_a class5_a">...</tr>
<!-- etc. -->

但是,我想排除(或者不首先选择它们)那些以class1 class2 class3作为属性的行。

我该怎么办?
谢谢你的帮助!

1 个答案:

答案 0 :(得分:9)

如果没有正则表达式,也许更容易。这适用于BeautifulSoup 3:

from BeautifulSoup import BeautifulSoup

page = """
<tr class="class1 class2 class3">1</tr>
<tr class="class1 class2 class3">2</tr>
<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>
<tr>7</tr>"""

def cond(x):
    if x:
        return x.startswith("class1") and not "class2 class3" in x
    else:
        return False

soup = BeautifulSoup(page)
rows = soup.findAll('tr', {'class': cond})

for row in rows:
    print row

=&GT;

<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>

使用BeautifulSoup 4,我能够使其工作如下:

import re
from bs4 import BeautifulSoup

page = """
<tr class="class1 class2 class3">1</tr>
<tr class="class1 class2 class3">2</tr>
<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>
<tr>7</tr>"""

soup = BeautifulSoup(page)
rows = soup.find_all('tr', {'class': re.compile('class1.*')})

for row in rows:
    cls = row.attrs.get("class")
    if not ("class2" in cls or "class3" in cls):
        print row

=&GT;

<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>

在BS4中,像class这样的多值属性将字符串列表作为其值,而不是字符串。请参阅http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id12