<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>
我想提取老师的名字 - &#34; Scott&#34;这是&#34;老师&#34;的标签。并提取所有学生&#39;名称标记为&#34;学生&#34;。我试过了:
soup.find(lambda tag:tag)
并返回
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
我认为这不是一种正确的方法。代码应该如何实际提取&#34;老师&#34;和#34;学生&#34;标签?
答案 0 :(得分:1)
假设您的HTML块在解析其他页面时不会有太大变化,您可以按类找到p
标记(您的示例没有)并验证Teacher
文本是否存在。
如果是从p标签获取.contents[1]
,这是元素上的第一个a
。
接下来查找a
属性与您的教师不匹配的所有href
代码。
示例:
from bs4 import BeautifulSoup
example = """<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>"""
soup = BeautifulSoup(example, "html.parser")
Classroom = soup.find(lambda x: "Teacher" in x.get_text())
if Classroom is not None:
Teacher = Classroom.contents[1]
TeacherUrl = Teacher["href"]
Students = Classroom.find_all(lambda tag: tag.has_attr('href') and TeacherUrl not in tag["href"])
print (Teacher.text)
for Student in Students:
print (Student.text)
哪个输出:
斯科特
笃
切瓦特
拉结
Benedict Wong