我在一个文件夹中有一系列html文件。我希望浏览每个html文件并删除重复的<li>
标记。例如:
<ul>
<li class="toctree-l1"><a class="reference internal" href="zone.html">zone</a></li>
<li class="toctree-l1"><a class="reference internal" href="zone.html">zone</a></li>
</ul>
到
<ul>
<li class="toctree-l1"><a class="reference internal" href="zone.html">zone</a></li>
</ul>
我需要打开每个文件,比较<li>
和<ul>
之间的</ul>
代码并删除重复项。
我试图使用python实现这一目标。我不确定如何解析元素并进行比较。有关如何完成此任务的任何建议?
到目前为止,我有这个。
@thanks alecxe。这是我的最终代码:
import sys
import os
from os import path
from bs4 import BeautifulSoup
directory_path = '..'
output_directory_path = '..'
files = [x for x in os.listdir(directory_path) if path.isfile(directory_path+os.sep+x)]
for fname in files:
fout = fname.split(".")[0]
#print fout
seen = set()
a = directory_path+"/"+fname
#if != directory_path+"/"+fname
if not a.endswith("_index.html"): continue
with open(a) as f:
#print f
soup = BeautifulSoup(f)
#print soup
for li in soup.select('ul li.toctree-l1'):
if li in seen:
li.extract() # remove tag if seen
else:
seen.add(li)
#print soup
fout = output_directory_path + "/" +fout+".html"
#print fout
fp = open(fout ,'w')
#print fp
soup = soup.prettify(soup.original_encoding)
#print soup
fp.write(soup)
fp.close()
答案 0 :(得分:0)
如果已经看到了一组看到的li
代码和remove li
代码,那就是
for fname in files:
seen = set()
with open(fname) as f:
soup = BeautifulSoup(f)
for li in soup.select('ul li.toctree-l1'):
if li in seen:
li.extract() # remove tag if seen
else:
seen.add(li)
留下剩下的逻辑供你实施。