如何使用Python从html中获取段落

时间:2016-04-22 22:02:42

标签: python html regex beautifulsoup

如何从结构不良的html中获取段落?

我有这个原始的html文本:

This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>

我使用Python并尝试做类似的事情:

soup = BeautifulSoup(html)

它会返回一个包含2个段落的新html文字:

<html>

<body>
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
        <br/>
    </p>
    <ul>
        <li>AA Early Childhood Education, or related field. </li>
        <li>2+ years experience in a licensed childcare facility </li>
        <li>Ability to meet state requirements, including finger print clearance. </li>
        <li>Excellent oral and written communication skills </li>
        <li>Strong organization and time management skills. </li>
        <li>Creativity in expanding children's learning through play.
            <br/> </li>
        <li>Strong classroom management skills.
            <br/> </li>
    </ul>
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
        <br/> </p>
</body>

</html>

但这不是我的预期。在结果中,我想得到这个HTML文本:

<html>

<body>
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
        AA Early Childhood Education, or related field.
        2+ years experience in a licensed childcare facility
        Ability to meet state requirements, including finger print clearance.
        Excellent oral and written communication skills
        Strong organization and time management skills.
        Creativity in expanding children's learning through play.
        Strong classroom management skills.
    </p>
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p>
</body>

</html>

为了获得高于html,我认为最好的方法是从原始html中删除除<p></p>之外的所有html标记。

为此,我尝试了以下正则表达式:

new_html = re.sub('<[^<]+?>', '', html)

显然,常规消除了所有html标签。那么,如何删除除<p></p>以外的所有html标记?

如果有人帮我写r.e.然后我将new_html提供给BeautifulSoup()并获得我期望的HTML。

2 个答案:

答案 0 :(得分:1)

这是一种手动文档操作,但是,您可以在remove之后将li元素和appending循环到第一段。然后,同时删除ul元素:

from bs4 import BeautifulSoup


data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
    <br>
</p>"""

soup = BeautifulSoup(data, "lxml")

p = soup.p
for li in soup.find_all("li"):
    p.append(li.get_text())
    li.extract()

soup.find("ul").extract()
print(soup.prettify())

打印您计划拥有的2段:

<html>
 <body>
  <p>
   This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
   <br/>
   AA Early Childhood Education, or related field.
   2+ years experience in a licensed childcare facility
   Ability to meet state requirements, including finger print clearance.
   Excellent oral and written communication skills
   Strong organization and time management skills.
   Creativity in expanding children's learning through play.
   Strong classroom management skills.
  </p>
  <p>
   The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
   <br/>
  </p>
 </body>
</html>

请注意,lxmlhtml.parserhtml5lib解析您发布的输入HTML的方式存在重大差异。 html5libhtml.parser不会自动创建第一段,使上面的代码真正具体lxml

更好的方法可能是制作一个单独的汤#34;宾语。样品:

from bs4 import BeautifulSoup


data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
    <br>
</p>"""

soup = BeautifulSoup(data, "lxml")

# create new soup
new_soup = BeautifulSoup("<body></body>", "lxml")
new_body = new_soup.body

# create first paragraph
first_p = new_soup.new_tag("p")
first_p.append(soup.p.get_text())

for li in soup.find_all("li"):
    first_p.append(li.get_text())

new_body.append(first_p)

# create second paragraph
second_p = soup.find_all("p")[-1]
new_body.append(second_p)

print(new_soup.prettify())

打印:

<html>
 <body>
  <p>
   This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
   AA Early Childhood Education, or related field.
   2+ years experience in a licensed childcare facility
   Ability to meet state requirements, including finger print clearance.
   Excellent oral and written communication skills
   Strong organization and time management skills.
   Creativity in expanding children's learning through play.
   Strong classroom management skills.
  </p>
  <p>
   The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
   <br/>
  </p>
 </body>
</html>

答案 1 :(得分:1)

简短回答

new_html = re.sub('<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', '', html)

长答案

你原来的正则表达式看起来很怪异。我会放[^>]而不是[^<]。你想要&#34;任何不是结束标签的东西&#34;。

另外,将+后跟?放在一起很奇怪。

+表示:&#34;重复1次或更多次&#34;

?表示:&#34;重复0或一次&#34;。

两个标志都很奇怪。

无论如何,我们可以像这样表达你的正则表达式:

&#34;打开标签&#34;,然后&#34;任何不是&#39; p&#39;而不是/ p&#34;,然后&#34;关闭标签&#34;

相当于:

&#34;打开标签&#34;,然后是#34;一个不是&#39;&#39;&#34;&#34;&#34;&#34;&#34;或者&#34;任何不是斜线的东西,然后是一个或多个字符&#34;或者&#34;斜线然后是一个不是&#39;&#39;&#34;&#34;&#34;&#34;&#34;&#34;&#34;或者&#34;斜线然后是两个或多个char&#34;,然后&#34;关闭标签&#34;。

相当于:

<然后([^p][^>/][^>]+/[^p]/[^>][^>]+)然后>

这就是上面的正则表达式所表达的。

这是一个在python控制台中输入的快速测试:

re.sub(
    '<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', 
    '', 
    'aa <p> bb <a> cc <li> dd <pp> ee <pa> ff </p> gg </a> hh </li> ii </pp> jj </pa> ff')