Question

我建议使用BeautifulSoup从HTML中删除具有特定ID的标记。例如，删除<div id=needDelete>...</div>以下是我的代码，但似乎无法正常工作：

import os, re
from bs4 import BeautifulSoup

cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)

# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files

def func(file):
    for file in os.listdir(cwd):
        if file.endswith('.html'):
            print ('HTML files are \n' + file)
            f = open(file, "r+")
            soup = BeautifulSoup(f, 'html.parser')
                matches  = str(soup.find_all("div", id="jp-post-flair"))
                #The soup.find_all part should be correct as I tested it to             
                #print the matches and the result matches the texts I want to delete.
                f.write(f.read().replace(matches,''))
                #maybe the above line isn't correct
            f.close()
func(file)

您是否可以帮助检查哪个部分的代码错误，以及我应该如何处理它？非常感谢!!

Answer 1

您可以使用.decompose() method删除元素/标记：

f = open(file, "r+")

soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
  element.decompose()

f.write(str(soup))

还值得一提的是，您可以使用.find()方法，因为id属性在文档中应该是唯一的（这意味着在大多数情况下可能只有一个元素）：

f = open(file, "r+")

soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
  element.decompose()

f.write(str(soup))

作为替代方案，基于以下评论：

如果您只想解析和修改文档的一部分，BeautifulSoup有一个SoupStrainer class，允许您有选择地解析部分文档。
您提到HTML文件中的缩进和格式正在更改。您可以在文档中查看相关的output formatting section，而不是直接将soup对象转换为字符串。

根据所需的输出，以下是一些可能的选项：
- soup.prettify(formatter="minimal")
- soup.prettify(formatter="html")
- soup.prettify(formatter=None)

使用python BeautifulSoup从HTML中删除具有特定id内容的特定标记

1 个答案: