BeautifulSoup-具有不同类名的find_all div标签

时间:2018-07-30 07:27:12

标签: python-3.x beautifulsoup

我想选择所有<div>,其中类名是post has-profile bg2post has-profile bg1但不是最后一个,即panel

<div id="6" class="post has-profile bg2"> some text 1 </div>
<div id="7" class="post has-profile bg1"> some text 2 </div>
<div id="8" class="post has-profile bg2"> some text 3 </div>
<div id="9" class="post has-profile bg1"> some text 4 </div>

<div class="panel bg1" id="abc"> ... </div>

select()仅匹配单个匹配项。我正在使用find_all()进行尝试,但是bs4找不到它。

if soup.find(class_ = re.compile(r"post has-profile [bg1|bg2]")):
    posts = soup.find_all(class_ = re.compile(r"post has-profile [bg1|bg2]"))

如何使用正则表达式和不使用正则表达式来解决?谢谢。

3 个答案:

答案 0 :(得分:1)

您可以在BeautifulSoup中使用内置的CSS选择器:

data = """<div id="6" class="post has-profile bg2"> some text 1 </div>
<div id="7" class="post has-profile bg1"> some text 2 </div>
<div id="8" class="post has-profile bg2"> some text 3 </div>
<div id="9" class="post has-profile bg1"> some text 4 </div>
<div class="panel bg1" id="abc"> ... </div>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

divs = soup.select('div.post.has-profile.bg2, div.post.has-profile.bg1')

for div in divs:
    print(div)
    print('-' * 80)

打印:

<div class="post has-profile bg2" id="6"> some text 1 </div>
--------------------------------------------------------------------------------
<div class="post has-profile bg2" id="8"> some text 3 </div>
--------------------------------------------------------------------------------
<div class="post has-profile bg1" id="7"> some text 2 </div>
--------------------------------------------------------------------------------
<div class="post has-profile bg1" id="9"> some text 4 </div>
--------------------------------------------------------------------------------

'div.post.has-profile.bg2, div.post.has-profile.bg1'选择器选择类<div>的所有"post hast-profile bg2"标签和类<div>的所有"post hast-profile bg1"标签。

答案 1 :(得分:1)

您可以定义一个描述感兴趣标签的函数:

let num1: number = 01;

并将该功能应用于“汤”:

def test_tag(tag):
    return tag.name=='div' \
       and tag.has_attr('class') \
       and "post" in tag['class'] \
       and "has-profile" in tag['class'] \
       and ("bg1" in tag['class'] or "bg2" in tag['class']) \
       and "panel" not in tag['class']

答案 2 :(得分:0)

使用正则表达式。

尝试:

from bs4 import BeautifulSoup
import re
s = """<div id="6" class="post has-profile bg2"> some text 1 </div>
<div id="7" class="post has-profile bg1"> some text 2 </div>
<div id="8" class="post has-profile bg2"> some text 3 </div>
<div id="9" class="post has-profile bg1"> some text 4 </div>

<div class="panel bg1" id="abc"> ... </div>"""

soup = BeautifulSoup(s, "html.parser")
for i in soup.find_all("div", class_=re.compile(r"post has-profile bg(1|2)")):
    print(i)

输出:

<div class="post has-profile bg2" id="6"> some text 1 </div>
<div class="post has-profile bg1" id="7"> some text 2 </div>
<div class="post has-profile bg2" id="8"> some text 3 </div>
<div class="post has-profile bg1" id="9"> some text 4 </div>