从Div标签提取文本数据,而不是从子H3标签提取文本数据

时间:2019-02-15 10:25:44

标签: python-3.x web-scraping beautifulsoup

我有一个HTML代码段,我需要使用BeautifuSoup来获取数据:

<!doctype html>
<html lang="en">
    <body>
        <div class="sidebar-box">
            <h3><i class="fa fa-users"></i> Management Team</h3>
                        Chairman, Director
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-male"></i> Teacher</h3>
                        John Doe
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-mortar-board"></i> Awards </h3>
                        National Top Quality Educational Development
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-building"></i> School Type</h3>
                        Secondary
        </div>
    </body>
</html>

我需要从顶部“ John Doe”中获取第二个.text的{​​{1}}值,而不是在div标记中的.text值中{1}}。 我的挑战是,目前我可以同时获得两个文本值,如以下代码片段所示:

h3

这将输出:

div

但是,我只需要John Doe值。​​

3 个答案:

答案 0 :(得分:4)

我提供了2个解决方案。第一个不是最优雅的解决方案。但是,只要快点离开我的脑袋,您就可以再次拆分,然后将“老师”之后的所有内容组合在一起

选项1:

html = '''
!doctype html>
<html lang="en">
    <body>
        <div class="sidebar-box">
            <h3><i class="fa fa-users"></i> Management Team</h3>
                        Chairman, Director
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-male"></i> Teacher</h3>
                        John Doe
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-mortar-board"></i> Awards </h3>
                        National Top Quality Educational Development
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-building"></i> School Type</h3>
                        Secondary
        </div>
    </body>
</html>'''



from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()

school_head_teacher = school_head_teacher.split()[1:]
school_head_teacher = ' '.join(school_head_teacher)

print(school_head_teacher)

输出:

print(school_head_teacher)
John Doe

选项2:

我认为这个更好一些。您找到具有Teacher的标签。然后,您获得父标签。然后,由于需要第二部分,请使用.next_sibling并将其剥离。

soup4(text=re.compile('Teacher'))[0].parent.next_sibling.strip()

如果有多个老师,我会将它放在for循环中。但是您可以用顶部代码代替for循环

from bs4 import BeautifulSoup
import re

soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
for elem in soup4(text=re.compile('Teacher')):
    print (elem.parent.next_sibling.strip())

答案 1 :(得分:1)

另一个选择:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

teacher_name = soup.find_all('div', class_='sidebar-box')
print(teacher_name[1].contents[2].strip())

输出:

John Doe

答案 2 :(得分:1)

<div class="sidebar-box"> <h3><i class="fa fa-male"></i> Teacher</h3> John Doe </div>

由于John Doe是<h3><i class="fa fa-male"></i> Teacher</h3>的下一个兄弟姐妹

我们可以在<div class="sidebar-box">上结合使用find_next()和next_sibling

!doctype html>
<html lang="en">
    <body>
        <div class="sidebar-box">
            <h3><i class="fa fa-users"></i> Management Team</h3>
                        Chairman, Director
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-male"></i> Teacher</h3>
                        John Doe
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-mortar-board"></i> Awards </h3>
                        National Top Quality Educational Development
        </div>
        <div class="sidebar-box">
            <h3><i class="fa fa-building"></i> School Type</h3>
                        Secondary
        </div>
    </body>
</html>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup.find_all('div', {'class':'sidebar-box'})
head_teacher = school_head_teacher[1].find_next().next_sibling
print(head_teacher)

通过这种方式,您也可以遍历遵循相同模式的其他div。

for school_info in school_head_teacher:
    print (school_info.find_next().next_sibling)