Question

我想只获取＆lt; p＆gt;中的内容标记并删除多余的div标签。
我的代码是：

page = """
<p style="text-align: justify">content that I want
    <div ><!-- /316485075/agk_116000_pos_3_sidebar_mobile -->
        <div id="agk_116000_pos_3_sidebar_mobile">
            <script>
                script code
            </script>
        </div>
        <div class="nopadding clearfix hidden-print">
            <div align="center" class="col-md-12">
            <!-- /316485075/agk_116000_pos_4_conteudo_desktop -->
                <div id="agk_116000_pos_4_conteudo_desktop" style="height:90px; width:728px;">
                    <script>
                        script code
                    </script>
                </div>
            </div>
        </div>
    </div>
</p>
"""
soup = BeautifulSoup(page, 'html.parser')
p = soup.find_all('p', {'style' : 'text-align: justify'})

我只想获取字符串<p>content that I want</p>并删除所有div

Answer 1

您可以使用replace_with()功能删除标签及其内容。

soup = BeautifulSoup(html, 'html.parser')   # html is HTML you've provided in question
soup.find('div').replace_with('')
print(soup)

输出：

<p style="text-align: justify">content that I want

</p>

注意：我在此处使用soup.find('div')，因为所有不需要的标记都在第一个div标记内。因此，如果删除该标记，则会删除所有其他标记。但是，如果您要删除HTML格式不是这样的p标记以外的所有标记，则必须使用此标记：

for tag in soup.find_all():
    if tag.name == 'p':
        continue
    tag.replace_with('')

相当于：

[tag.replace_with('') for tag in soup.find_all(lambda t: t.name != 'p')]

如果您只想要content that I want文字，可以使用：

print(soup.find('p').contents[0])
# content that I want

Answer 2

捕获论坛2包含您的内容<(.*?)(?:\s.+?>)(.*?)</\1[>]?

请参阅https://regex101.com/r/m8DQic/1

如何从美丽的汤结果中删除多余的标签

2 个答案: