删除<ul>

时间:2015-12-21 21:17:12

标签: python html parsing beautifulsoup

我在一个文件夹中有一系列html文件。我希望浏览每个html文件并删除重复的<li>标记。例如:

<ul>
<li class="toctree-l1"><a class="reference internal" href="zone.html">zone</a></li>
<li class="toctree-l1"><a class="reference internal" href="zone.html">zone</a></li>
</ul>

<ul>
<li class="toctree-l1"><a class="reference internal" href="zone.html">zone</a></li>
</ul>

我需要打开每个文件,比较<li><ul>之间的</ul>代码并删除重复项。 我试图使用python实现这一目标。我不确定如何解析元素并进行比较。有关如何完成此任务的任何建议?

到目前为止,我有这个。

@thanks alecxe。这是我的最终代码:

import sys
import os
from os import path
from bs4 import BeautifulSoup

directory_path = '..'
output_directory_path = '..'
files = [x for x in os.listdir(directory_path) if path.isfile(directory_path+os.sep+x)]

for fname in files:
    fout = fname.split(".")[0]
    #print fout
    seen = set()
    a = directory_path+"/"+fname
    #if != directory_path+"/"+fname
    if not a.endswith("_index.html"): continue
    with open(a) as f:
        #print f
        soup = BeautifulSoup(f)
        #print soup
        for li in soup.select('ul li.toctree-l1'):
            if li in seen:
                li.extract()  # remove tag if seen
            else:
                seen.add(li)
        #print soup
        fout =  output_directory_path + "/" +fout+".html"
        #print fout
        fp = open(fout ,'w')
        #print fp
        soup = soup.prettify(soup.original_encoding)
        #print soup
        fp.write(soup)
        fp.close()

1 个答案:

答案 0 :(得分:0)

如果已经看到了一组看到的li代码remove li代码,那就是

for fname in files:
    seen = set()
    with open(fname) as f:
        soup = BeautifulSoup(f)
        for li in soup.select('ul li.toctree-l1'):
            if li in seen:
                li.extract()  # remove tag if seen
            else:
                seen.add(li)

留下剩下的逻辑供你实施。