使用BeautifulSoup切割/切片HTML文档?

时间:2016-03-23 21:52:09

标签: python html beautifulsoup html-parsing

我有一个HTML文档如下:

<h1> Name of Article </h2> 
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2> 
<p>Html I do not want...</p>

我不需要文章的参考资料,我想在第二个h2标签处切片。

显然我可以找到像这样的h2标签列表:

soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it 

我不想获得h2标签列表,我想在第二个h2标签上切片文档,并将上述内容保存在一个新变量中。基本上我想要的输出是:

<h1> Name of Article </h2> 
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>

最好的办法是什么才能做到这一点&#34;切割&#34; /切割HTML文档而不是简单地找到标签并输出标签本身?

2 个答案:

答案 0 :(得分:1)

你可以remove/extract&#34;参考&#34;的每个兄弟元素。元素和元素本身:

public FileContentResult GetImage(int id)
    {
        Photo photo = context.FindPhotoById(id);
        if (photo.PhotoFile != null)
        {
            return File(photo.PhotoFile, photo.ImageMimeType);
        }
        else
        {
            return null;
        }
    }

打印:

import re
from bs4 import BeautifulSoup

data = """
<div>
    <h1> Name of Article </h2>
    <p>First Paragraph I want</p>
    <p>More Html I'm interested in</p>
    <h2> Subheading in the article I also want </h2>
    <p>Even more Html i want to pull out of the document.</p>
    <h2> References </h2>
    <p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")

references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
    elm.extract()
references.extract()

print(soup)

答案 1 :(得分:0)

您可以在字符串中找到h2的位置,然后通过它找到子字符串:

last_h2_tag = str(soup.find_all("h2")[-1]) 
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]