在标签之间提取HTML

时间:2016-07-10 09:46:11

标签: python beautifulsoup

我想在特定HTML标记之间提取所有HTML。

<html>
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>

所以想要在class1 divclass2 span之间grep所有HTML(标记和值)。

Included Text
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>

HTML文件中也有多次出现,所以我想将它们全部匹配。这就是我的意思:

<html>
(first occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>

(2nd occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>  

(third occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>  
</html>

我一直在使用Beautifulsoup 4搜索答案。但是,所有问题/答案都与提取文本之间的值有关,但这不是我想要的。我也想知道是否甚至可以使用Beautifulsoup,或者我是否必须使用正则表达式。

1 个答案:

答案 0 :(得分:2)

您可以使用 bs4 itertools.takewhile

对自己的功能进行角色扮演
h  = """<html>
 <div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>"""

soup = BeautifulSoup(h)
def get_html_between(start_select, end_tag, cls):
    start = soup.select_one(start_select)
    all_next = start.find_all_next()
    yield "".join(start.contents)
    for t in takewhile(lambda tag: tag.get("name") != end_tag and tag.get("class") != [cls], all_next):
        yield t

for ele in get_html_between("div.class1","div","class2"):
    print(ele)

输出:

Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]
</span>
<div>[...]</div>

为了使它更灵活,您可以传入初始标记和 cond lambda /函数,对于多个class1只需迭代并传递每个:

def get_html_between(start_tag, cond):
    yield "".join(start_tag.contents)
    all_next = start_tag.find_all_next()
    for ele in takewhile(cond, all_next):
        yield ele


cond = lambda tag: tag.get("name") != "div" and tag.get("class") != ["class2"]
soup = BeautifulSoup(h, "lxml")
for tag in soup.select("div.class1"):
    for ele in get_html_between(tag, cond):
        print(ele)

使用您的最新编辑:

In [15]: cond = lambda tag: tag.get("name") != "div" and tag.get("class") != ["class2"]

In [16]: for tag in soup.select("div.class1"):            
            for ele in get_html_between(tag, cond):
                print(ele)
            print("\n")
   ....:         
Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>


Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>


Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>