如何使用Beautifulsoup4在<hr>标记之间提取文本?

时间:2020-04-10 19:49:45

标签: python html css web-scraping beautifulsoup

我正在努力寻找一种方法来分别提取本文档中出现的所有hr标签之间的所有文本:

<html>
<head>
<!--Created 6-11-96 by Dan Axtell-->
<meta content="101 Elementary School Mission Statements compiled from the Web 11 June 1996" name="DESCRIPTION"/>
</head>
<body><a name="TOP">
<h2>101 Elementary School Mission Statements</h2>
11 June 1996
<p>
This list was compiled when the web was young. Most links don't work now. Note that some of these mission statements may be copyrighted. All material was pasted verbatim from the web pages, which accounts for the odd formating.
</p><hr/>
                                 WINDSOR Elementary School,
                            in partnership with its children, families,
                             community and Richland District Two,
                           guarantees each child a superior education
              by providing quality instruction and challenging learning experiences
                                in a safe and orderly environment
                  which will foster life-long learning and responsible citizenship.
</a>
<a href="http://www.scsn.net/users/rich2/elem/windsor/text.htm">http://www.scsn.net/users/rich2/elem/windsor/text.htm</a>
<hr/>

This We Believe...
Yokayo Elementary School provides a nurturing environment committed to achiving excellence. All students are challenged to
reach their maximum potential by learning at their functional level to provide a solid foundation of skills, knowledge and values.
This foundation enables each student to become a well-educated, productive adult able to cope with an ever changing world.

We believe that all learners must become:

     Effective Communicators who will use verbal, written, artistic and technological forms of communication to give,
     send, and receive information.
     Inspired Learners who are accountable for demonstrating, assessing, and directing their present and life-long
     intellectual growth.
     Productive Workers who perform collaboratively and independently to create quality products and services that
     reflect personal pride and responsiblility.
     Responsible Citizens who have a global and multi-cultural perspective, and who take the initiative for improving the
     quality of life for self and others.
     Resourceful Thinkers who independently and creatively strive to solve complex problems through reflection, risk
     taking, and critical evaluation.
<a href="http://happy.yokayo.uusd.k12.ca.us/Goals.html">http://happy.yokayo.uusd.k12.ca.us/Goals.html</a>
<hr/>
University Elementary School

                                    Mission Statement

  At University Elementary School, students should be accepted, appreciated, nurtured, and
challanged according to their individual needs.

  Through their education at school, students should gain the skills, strategies, and desire
necessary for continued learning.  They should also develop a strong sense of responsibility for
themselves and toward each other, their community, and the earth's resources.

 To this end, faculty and staff should create a rich multicultural environment for learning; design
an integrated curriculum with strong science, fine arts, and social studies components; provide for
children to become self-directed learners; and share their enthusiasm for learning, in an
atmosphere of mutual respect and appreciation.
<a href="http://www.intersource.com/~wmorales/ue/mission.html">http://www.intersource.com/~wmorales/ue/mission.html</a>
<hr/>

文档中有100个摘录,这只是一个示例。但是整个过程的格式保持不变。我尝试像这样使用.nextSibling:

for i in soup.find_all('hr'):
    print(i.nextSibling)

的输出为

                                 WINDSOR Elementary School,



University Elementary School

Altamont Elementary School

...

如何扩展此功能以包括直到下一个hr标签的所有内容,这样我就可以提取整个语句,例如:

WINDSOR Elementary School,
                            in partnership with its children, families,
                             community and Richland District Two,
                           guarantees each child a superior education
              by providing quality instruction and challenging learning experiences
                                in a safe and orderly environment
                  which will foster life-long learning and responsible citizenship.
</a>
<a href="http://www.scsn.net/users/rich2/elem/windsor/text.htm">http://www.scsn.net/users/rich2/elem/windsor/text.htm</a>

0 个答案:

没有答案
相关问题