Python及其新手BeautifulSoup。我有一个Python程序打开一个名为“example.html”的文件,在其上运行一个BeautifulSoup操作,然后在其上运行Bleach操作,然后将结果保存为文件“example-packaged.html”。到目前为止,它适用于“example.html”的所有内容。
我需要修改它,以便打开文件夹“/ posts /”中的每个文件,在其上运行程序,然后将其保存为“/posts-cleaned/X-cleaned.html”,其中X是原始文件文件名。
这是我的代码,最小化:
from bs4 import BeautifulSoup
import bleach
import re
text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"
援助&很高兴收到现有解决方案的指示!
答案 0 :(得分:4)
您可以使用os.listdir()
获取目录中所有文件的列表。如果您希望一直向下递归目录树,则需要os.walk()
。
我会移动所有这些代码来处理单个文件来运行,然后编写第二个函数来处理解析整个目录。像这样:
def clean_dir(directory):
os.chdir(directory)
for filename in os.listdir(directory):
clean_file(filename)
def clean_file(filename):
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
with open(filename, 'r') as fhandle:
text = BeautifulSoup(fhandle)
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
# this appends -cleaned to the file;
# relies on the file having a '.'
dot_pos = filename.rfind('.')
cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])
with open(cleaned_filename, 'w') as fout:
fout.write(cleaned.encode("utf-8"))
print "Done"
然后你只需拨打clean_dir('/posts')
或不是。
我正在将“-cleaned”附加到文件中,但我想我更喜欢使用整个新目录的想法。这样,如果某个文件等已经存在-cleaned
,则不必处理冲突。
我也在使用with
语句在这里打开文件,因为它关闭它们并自动处理异常。
答案 1 :(得分:1)
回答我自己的问题,对于那些可能会发现os.listdir的Python文档有点无益的人:
from bs4 import BeautifulSoup
import bleach
import re
import os, os.path
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
postlist = os.listdir("posts/")
for post in postlist:
# HERE: you need to specify the directory again, the value of "post" is just the filename:
text = BeautifulSoup(open("posts/"+post))
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts-cleaned/"+post, "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
我作弊并制作了一个名为“posts-cleaning /”的单独文件夹,因为存储文件比分割文件名更容易,添加“已清理”并重新加入,但是如果有人想告诉我一个好方法要做到这一点,那就更好了。