Question

我想解析html页面的一部分，比如说

my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
   <a href="#">Link1</a>
   <a href="#">Link2</a>
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""

我将此字符串传递给BeautifulSoup：

soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template

但在解析BeautifulSoup时会添加<html>，<head>和<body>标记（如果使用lxml或html5lib解析器），我的代码中不需要这些标记。我到目前为止唯一能避免这种情况的方法是使用html.parser。

我想知道是否有办法使用lxml摆脱多余的标签 - 最快的解析器。

更新

最初我的问题被错误地询问了。现在我从我的示例中删除了<div>包装器，因为普通用户不使用此标记。因此，我们无法使用.extract()方法删除<html>，<head>和<body>标记。

Answer 1

使用

soup.body.renderContents()

Answer 2

lxml将始终添加这些标记，但您可以使用Tag.extract()从其中删除<div>标记：

comment = soup.body.div.extract()

Answer 3

我可以使用.contents属性来解决问题：

try:
    children = soup.body.contents
    string = ''
    for child in children:
        string += str(item)
    return string
except AttributeError:
    return str(soup)

我认为''.join(soup.body.contents)对于字符串转换来说会更加整齐，但这不起作用我得到了

TypeError：序列项0：期望字符串，找到标记

BeautifulSoup：只解析部分页面

3 个答案: