我发现方法BeautifulSoup.find()按空格分割类属性。 在这种情况下,我无法在下面的代码中使用正则表达式。 你能不能帮助我找到所有'树儿'元素:
import re
from bs4 import BeautifulSoup
r_html = "<div class='root'>" \
"<div class='tree children1'>text children 1 </div>" \
"<div class='tree children2'>text children 2 </div>" \
"<div class='tree children3'>text children 3 </div>" \
"</div>"
bs_tab = BeautifulSoup(r_html, "html.parser")
workspace_box_visible = bs_tab.findAll('div', {'class':'tree children1'})
print workspace_box_visible # result: [<div class="tree children1">textchildren 1 </div>]
workspace_box_visible = bs_tab.findAll('div', {'class':re.compile('^tree children\d')})
print workspace_box_visible # result: [] >>>> empty array because
#class name was splited by whitespace character<<<<
# >>>>>> print all element classes <<<<<<<
def print_class(class_):
print class_
return False
workspace_box_visible = bs_tab.find('div', {'class': print_class})
# expected:
# root
# tree children1
# tree children2
# tree children3
# actual:
# root
# tree
# children1
# tree
# children2
# tree
# children3
提前致谢,
====评论==========
stackoverflow站点不允许添加超过500个字符的注释, 所以我在这里添加了评论:
上面,举例说明了BeautifulSoup如何寻找所需的类。
但是,如果我有DOM结构,如:
r_html = "<div class='root'>" \
"<div class='tree children'>zero</div>" \
"<div class='tree children first'>first</div>" \
"<div class='tree children second'>second</div>" \
"<div class='tree children third'>third</div>" \
"</div>"
当需要选择具有类属性的控件时:' 树儿 '和' 树儿童 ”, 你的(Padraic Cunningham)帖子中描述的所有方法都不起作用。
我找到了使用正则表达式的解决方案:
controls = bs_tab.findAll('div')
for control in controls:
if re.search("^tree children|^tree children first", " ".join(control.attrs['class'] if control.attrs.has_key('class') else "")):
print control
和另一种解决方案:
bs_tab.findAll('div', class_='tree children') + bs_tab.findAll('div', class_='tree children first')
我知道,这不是好办法。我希望BeautifulSoup模块有适当的方法。
答案 0 :(得分:3)
根据html的结构,有几种不同的方式,它们是css类,所以你可以使用 .select 来使用class_=..
或css选择器:
In [3]: bs_tab.find_all('div', class_="tree")
Out[3]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
In [4]: bs_tab.select("div.tree")
Out[4]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
但是,如果你在其他地方有另一个树类,那么也会找到它。
您可以使用选择器在类中查找包含 children 的div:
In [5]: bs_tab.select("div[class*=children]")
Out[5]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
但是,如果名称中还有其他带有子项的标记类,那么它们也会被选中。
您可以使用正则表达式更具体一点,并查找 children 后跟一个或多个数字:
In [6]: bs_tab.find_all('div', class_=re.compile("children\d+"))
Out[6]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
或找到所有 div.tree的并查看标签中的姓氏[“class”] starstwith children < / em>的
In [7]: [t for t in bs_tab.select("div.tree") if t["class"][-1].startswith("children")]
Out[7]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]
或寻找孩子,看看第一个css类名是否等于 tree
In [8]: [t for t in bs_tab.select("div[class*=children]") if t["class"][0] == "tree"]
Out[8]:
[<div class="tree children1">text children 1 </div>,
<div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]