Question

我正在尝试打开目录中的所有html文件（到目前为止一直很好），在每个文件中找到页脚元素（也很好），删除页脚（没有骰子），然后将结果写回没有页脚的html文件（也没有骰子）。

这是我所拥有的：

    AttributeError: 'list' object has no attribute 'replaceWith'

这给了我以下错误：

  for files in filenames:                       
      soup = BeautifulSoup (open(files, "w+"))  
      bottom = soup.findAll("footer")           
      decompose(bottom)

我也试过

    NameError: global name 'decompose' is not defined

这给了我以下错误：

<div ng-controller="Ctrl1 as c1">
  <--Using c1 here -->
  <div ng-controller="Ctrl2 as c2">
  <-- Using c2 here -->
  </div>
  <--Using c1 here -->
</div>

我很高兴有一个BeautifulSoup3或bs4解决这个问题的方法，特别是如果有办法将每个html文件保存为单独的文件并删除它的页脚。

Answer 1

您需要更改为 -

for files in filenames:              
    soup = BeautifulSoup (open(files))
    bottom = soup.findAll("footer")
    for single_footer in bottom:
        single_footer.decompose()
        #Then save

如何使用os.walk - 遍历目录并更改所有文件的页脚如下 -

from bs4 import BeautifulSoup as bs
import os

input_dir = r"C:\Users\User\Desktop\test"

for root,dirs,files in os.walk(input_dir):
    for single_file in files:
        with open(os.path.join(root,single_file),'r+') as inpt:
            soup = bs(inpt.read(),'lxml')
            if len(soup.findAll('footer'))>0:
                for footer in soup.findAll('footer'):
                    footer.decompose()
                inpt.seek(0)#rewind
                inpt.write(soup.encode('utf-8'))

Answer 2

要删除BeautifulSoup中的标记，您应该使用decompose。在你的情况下应该是：

import codecs
for files in filenames:              
    soup = BeautifulSoup (open(files))
    soup.footer.decompose()
    f=codecs.open("abc1.html", mode="w", encoding="utf-8")
    f.write(soup.prettify())
    f.close()

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

BeautifulSoup：从目录中的文件中划分html元素并将内容写入文件

2 个答案: