我有一个HTML代码段,我需要使用BeautifuSoup来获取数据:
<!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>
我需要从顶部“ John Doe”中获取第二个.text
的{{1}}值,而不是在div
标记中的.text
值中{1}}。
我的挑战是,目前我可以同时获得两个文本值,如以下代码片段所示:
h3
这将输出:
div
但是,我只需要John Doe值。
答案 0 :(得分:4)
我提供了2个解决方案。第一个不是最优雅的解决方案。但是,只要快点离开我的脑袋,您就可以再次拆分,然后将“老师”之后的所有内容组合在一起
选项1:
html = '''
!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>'''
from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()
school_head_teacher = school_head_teacher.split()[1:]
school_head_teacher = ' '.join(school_head_teacher)
print(school_head_teacher)
输出:
print(school_head_teacher)
John Doe
选项2:
我认为这个更好一些。您找到具有Teacher
的标签。然后,您获得父标签。然后,由于需要第二部分,请使用.next_sibling
并将其剥离。
soup4(text=re.compile('Teacher'))[0].parent.next_sibling.strip()
如果有多个老师,我会将它放在for循环中。但是您可以用顶部代码代替for
循环
from bs4 import BeautifulSoup
import re
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
for elem in soup4(text=re.compile('Teacher')):
print (elem.parent.next_sibling.strip())
答案 1 :(得分:1)
另一个选择:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
teacher_name = soup.find_all('div', class_='sidebar-box')
print(teacher_name[1].contents[2].strip())
输出:
John Doe
答案 2 :(得分:1)
自<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
由于John Doe是<h3><i class="fa fa-male"></i> Teacher</h3>
的下一个兄弟姐妹
我们可以在<div class="sidebar-box">
上结合使用find_next()和next_sibling
!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup.find_all('div', {'class':'sidebar-box'})
head_teacher = school_head_teacher[1].find_next().next_sibling
print(head_teacher)
通过这种方式,您也可以遍历遵循相同模式的其他div。
for school_info in school_head_teacher:
print (school_info.find_next().next_sibling)