我建议使用BeautifulSoup从HTML中删除具有特定ID的标记。例如,删除<div id=needDelete>...</div>
以下是我的代码,但似乎无法正常工作:
import os, re
from bs4 import BeautifulSoup
cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)
# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files
def func(file):
for file in os.listdir(cwd):
if file.endswith('.html'):
print ('HTML files are \n' + file)
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
matches = str(soup.find_all("div", id="jp-post-flair"))
#The soup.find_all part should be correct as I tested it to
#print the matches and the result matches the texts I want to delete.
f.write(f.read().replace(matches,''))
#maybe the above line isn't correct
f.close()
func(file)
您是否可以帮助检查哪个部分的代码错误,以及我应该如何处理它? 非常感谢!!
答案 0 :(得分:1)
您可以使用.decompose()
method删除元素/标记:
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
element.decompose()
f.write(str(soup))
还值得一提的是,您可以使用.find()
方法,因为id
属性在文档中应该是唯一的(这意味着在大多数情况下可能只有一个元素):
f = open(file, "r+")
soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
element.decompose()
f.write(str(soup))
作为替代方案,基于以下评论:
SoupStrainer
class,允许您有选择地解析部分文档。您提到HTML文件中的缩进和格式正在更改。您可以在文档中查看相关的output formatting section,而不是直接将soup
对象转换为字符串。
根据所需的输出,以下是一些可能的选项:
soup.prettify(formatter="minimal")
soup.prettify(formatter="html")
soup.prettify(formatter=None)