用美丽的汤提取几个价值观

时间:2017-03-07 11:27:18

标签: tags beautifulsoup

 <p class="">
    Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
             <span class="ghost">|</span> 
    Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>, 
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>, 
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>, 
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
    </p>

我想提取老师的名字 - &#34; Scott&#34;这是&#34;老师&#34;的标签。并提取所有学生&#39;名称标记为&#34;学生&#34;。我试过了: soup.find(lambda tag:tag)并返回

<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>

我认为这不是一种正确的方法。代码应该如何实际提取&#34;老师&#34;和#34;学生&#34;标签?

1 个答案:

答案 0 :(得分:1)

假设您的HTML块在解析其他页面时不会有太大变化,您可以按类找到p标记(您的示例没有)并验证Teacher文本是否存在。

如果是从p标签获取.contents[1],这是元素上的第一个a

接下来查找a属性与您的教师不匹配的所有href代码。

示例:

from bs4 import BeautifulSoup

example = """<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
         <span class="ghost">|</span> 
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>, 
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>, 
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>, 
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>"""

soup = BeautifulSoup(example, "html.parser")

Classroom = soup.find(lambda x: "Teacher" in x.get_text())

if Classroom is not None:

    Teacher = Classroom.contents[1]
    TeacherUrl = Teacher["href"]

    Students = Classroom.find_all(lambda tag: tag.has_attr('href') and TeacherUrl not in tag["href"])

    print (Teacher.text)
    for Student in Students:
        print (Student.text)

哪个输出:

  

斯科特

     

     

切瓦特

     

拉​​结

     

Benedict Wong