如何获得" subsoups"并连接/加入它们?

时间:2015-12-30 13:34:24

标签: python html beautifulsoup html-parsing

我有一个我需要处理的HTML文档。我正在使用' beautifoulsoup'为了那个原因。现在我想找回一些" subsoups"从该文档中将它们加入一个汤中,以便我以后可以将它用作需要汤对象的函数的参数。

如果不清楚,我会给你一个例子......

from bs4 import BeautifulSoup

my_document = """
<html>
<body>

<h1>Some Heading</h1>

<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>

<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>

<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>

<p id="loner">A paragraph.</p>

</body>
</html>
"""

soup = BeautifulSoup(my_document)

# find the needed parts
first = soup.find("div", {"id": "first"})
third = soup.find("div", {"id": "third"})
loner = soup.find("p", {"id": "loner"})
subsoups = [first, third, loner]

# create a new (sub)soup
resulting_soup = do_some_magic(subsoups)

# use it in a function that expects a soup object and calls its methods
function_expecting_a_soup(resulting_soup)

目标是让resulting_soup中的对象具有以下内容,就像汤一样:

<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>

<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>

<p id="loner">A paragraph.</p>

有没有方便的方法呢?如果有更好的方法来检索&#34; subsoups&#34;比find(),我可以使用它。感谢。

更新

Wondercricket建议使用solution连接包含找到的标签的字符串,然后再将它们解析为一个新的BeautifulSoup对象。虽然这是解决问题的一种可能方法,但重新解析可能需要更长的时间,尤其是当我想要检索其中的大多数时,我需要处理许多此类文档。 find()会返回bs4.element.Tag。是不是有办法如何将几个Tag连接成一个汤而不将Tag s转换为字符串并解析字符串?

2 个答案:

答案 0 :(得分:5)

SoupStrainer会完全按照您的要求进行操作,作为奖励,您将获得性能提升,因为它会完全解析您要解析的内容 - 而不是完整的文档树:

from bs4 import BeautifulSoup, SoupStrainer

parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

现在,soup对象只包含所需的元素:

<div id="first">
 <p>
  A paragraph.
 </p>
 <a href="another_doc.html">
  A link
 </a>
 <p>
  A paragraph.
 </p>
</div>
<div id="third">
 <p>
  A paragraph.
 </p>
 <a href="another_doc.html">
  A link
 </a>
 <a href="yet_another_doc.html">
  A link
 </a>
</div>
<p id="loner">
 A paragraph.
</p>
  

是否也可以不仅指定ID而且还指定标签?例如,如果我想用class =“someclass过滤所有段落而不是使用相同类的div?

在这种情况下,您可以search function加入SoupStrainer的多个条件:

from bs4 import BeautifulSoup, SoupStrainer, ResultSet

my_document = """
<html>
<body>

    <h1>Some Heading</h1>

    <div id="first">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <p>A paragraph.</p>
    </div>

    <div id="second">
    <p>A paragraph.</p>
    <p>A paragraph.</p>
    </div>

    <div id="third">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <a href="yet_another_doc.html">A link</a>
    </div>

    <p id="loner">A paragraph.</p>

    <p class="myclass">test</p>
</body>
</html>
"""

def search(tag, attrs):
    if tag == "p" and "myclass" in attrs.get("class", []):
        return tag

    if attrs.get("id") in ["first", "third", "loner"]:
        return tag


parse_only = SoupStrainer(search)

soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

print(soup.prettify())

答案 1 :(得分:3)

您可以使用findAll传递要使用的ids元素。

import bs4

soup = bs4.BeautifulSoup(my_document)

#EDIT -> I discovered you do not need regex, you can pass in a list of `ids`
sub = soup.findAll(attrs={'id': ['first', 'third', 'loner']})

#EDIT -> adding `html.parser` will force `BeautifulSoup` to not auto append `html` and `body` tags.
sub = bs4.BeautifulSoup('\n\n'.join(str(s) for s in sub), 'html.parser')

print(sub)

>>> <div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>