通过某些HTML结构使用BeautifulSoup拆分文本

时间:2015-02-01 17:39:55

标签: html beautifulsoup

我试图根据某种模式拆分一些HTML。

HTML的特定部分必须分为1个或多个部分或文本数组。我能够划分此HTML的方式是查看第一个<strong>和一个<br />。必须将这两个标记之间的所有文本放入列表中并迭代。

如何轻松解决这个问题?

所以我想要以下HTML:

<div class="clearfix">
    <!--# of ppl associated with place-->
        This is some kind of buzzword:<br />
    <br />
    <!--Persontype-->
        <strong>Jimbo</strong> Jack            <br />
Some filler text            <br />
More weird stuff
            <br />
Unstructured text        <br />
        <br />
        <strong>Jacky</strong> Bradson            <br />
This is just a test            <br />
Nothing but a test
            <br />
More unstructured stuff        <br />
        <br />
        <strong>Junior</strong> Bossman            <br />
This is fluffy
            <br />
As I would expect        <br />
        <br />
</div>

分为以下部分。

第一部分:

        <strong>Jimbo</strong> Jack            <br />
Some filler text            <br />
More weird stuff
            <br />
Unstructured text        <br />
        <br />

第二部分:

        <strong>Jacky</strong> Bradson            <br />
This is just a test            <br />
Nothing but a test
            <br />
More unstructured stuff        <br />
        <br />

第三部分:

        <strong>Junior</strong> Bossman            <br />
This is fluffy
            <br />
As I would expect        <br />
        <br />
</div>

1 个答案:

答案 0 :(得分:3)

基本解决方案是使用join,prettify和split。基本思想是将其转换为文本并将感兴趣的部分分开。

from bs4 import BeautifulSoup
soup = BeautifulSoup(''.join(text))
for i in soup.prettify().split('<!--Persontype-->')[1].split('<strong>'):
    print '<strong>' + ''.join(i)

文本文件是:

text= '''
<div class="clearfix">
    <!--# of ppl associated with place-->
        This is some kind of buzzword:<br />
    <br />
    <!--Persontype-->
        <strong>Jimbo</strong> Jack            <br />
Some filler text            <br />
More wierd stuff
            <br />
Unstructured text        <br />
        <br />
        <strong>Jacky</strong> Bradson            <br />
This is just a test            <br />
Nothing but a test
            <br />
More unstructured stuff        <br />
        <br />
        <strong>Junior</strong> Bossman            <br />
This is fluffy
            <br />
As i would expect        <br />
        <br />
</div>
'''

输出是:

  

<强>       Jimbo Jack jack一些填充文字
更多奇怪的东西
非结构化文本

      的       Jacky Bradson
这只是一个测试,只有一个测试。更多非结构化的东西   

      的       少年 Bossman
这很蓬松,正如我所料的那样