如何通过python3删除html特定行下面的所有数据

时间:2017-04-16 17:46:05

标签: html python-3.x

我有一个html文件,其中数据包含超过300行。我想删除特定行下面的所有数据。例如,我想删除以下行下面的所有数据。怎么样?

<pre>
Page 5

如果可能,请保留结束标记,这是html的最后一行。

<hr></body></html>

我写了以下代码。但它只删除了特定的(第5页)行。我想删除下面的所有行&#34; Page 3&#34;。怎么样?

f = open("4105.html","r")
lines = f.readlines()
f.close()
f = open("4105-modified.html","w")
for line in lines:
  if line!='''Page 5'''+"\n":
    f.write(line)

1 个答案:

答案 0 :(得分:2)

找到Page 5后停止写行:

with open('4105.html') as inf, open('4105-modified.html','w') as outf:
    for line in inf:
        outf.write(line)
        if line == 'Page 5\n':
            break

    # if you want the last tags to remain
    outf.write('<hr></body></html>')

我会考虑使用像BeautifulSoup这样的HTML解析器。

修改每条评论(未经测试)

with open('4105.html') as inf, open('4105-modified.html','w') as outf:
    lines = inf.readlines()
    idx = lines.index('Page 5\n')
    if idx != -1: # found it
        del lines[idx - 1] # delete line before
        del lines[idx:-1]  # delete all lines except last to keep trailing tags.
    outf.write(''.join(lines))