我是python的新手并尝试执行以下操作的程序:
打开目录路径中的所有文件夹和子文件夹
识别HTML文件
在BeautifulSoup中加载HTML
找到第一个正文标记
如果身体标签后面紧跟着< Google跟踪代码管理器>然后继续
如果没有,则添加< Google跟踪代码管理器>代码并保存文件。
我无法扫描每个文件夹中的所有子文件夹。 如果<我不能设置seen() Google跟踪代码管理器>在身体标签后立即出现。 任何帮助执行上述任务表示赞赏。
我的代码尝试如下:
import sys
import os
from os import path
from bs4 import BeautifulSoup
directory_path = '/input'
files = [x for x in os.listdir(directory_path) if path.isfile(directory_path+os.sep+x)]
for root, dirs, files in os.walk(directory_path):
for fname in files:
seen = set()
a = directory_path+os.sep+fname
if fname.endswith(".html"):
with open(a) as f:
soup = BeautifulSoup(f)
for li in soup.select('body'):
if li in seen:
continue
else:
seen.add("<!-- Google Tag Manager --><noscript><iframe src='//www.googletagmanager.com/ns.html?id=GTM-54QWZ8'height='0' width='0' style='display:none;visibility:hidden'></iframe></noscript><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-54QWZ8');</script><!-- End Google Tag Manager —>\n")
答案 0 :(得分:2)
所以你可以为python安装iglob库。使用iglob,您可以递归遍历指定的主目录和子目录,并列出具有给定扩展名的所有文件。然后打开HTML文件,读取所有行,手动遍历行直到找到标记的“”,因为可能使用框架工作的一些用户可能在body标记内有其他内容。无论哪种方式,循环查找正文标记开头的行,然后检查下一行,如果指定“Google跟踪代码管理器”的文本不在下一行,则将其写出。请记住,如果您在身体标记后面始终拥有Google跟踪代码管理器代码,我会写这封信。
请记住:
以下是代码:
import glob
types = ('*.html', '*.htm')
paths = []
for fType in types:
for filename in glob.iglob('./**/' + fType, recursive=True):
paths.append(filename)
#print(paths)
for path in paths:
print(path)
with open(path,'r') as f:
lines = f.readlines()
with open(path, 'w') as w:
for i in range(0,len(lines)):
w.write(lines[i])
if "<body>" in lines[i]:
if "<!-- Google Tag Manager -->" not in lines[i+1]:
w.write('<!-- Google Tag Manager --> <!-- End Google Tag Manager -->\n')
答案 1 :(得分:0)
我接受它,可能会有一些错误:
已编辑添加:我已经意识到此代码无法确保<!-- Google Tag Manager -->
是<body>
之后的第一个标记,而是确保它是<body>
之后的第一个注释。这不是问题所要求的。
import fnmatch
import os
from bs4 import BeautifulSoup, Comment
from HTMLParser import HTMLParser
def get_soup(filename):
with open(filename, 'r') as myfile:
data=myfile.read()
return BeautifulSoup(data, 'lxml')
def write_soup(filename, soup):
with open(filename, "w") as file:
output = HTMLParser().unescape(soup.prettify())
file.write(output)
def needs_insertion(soup):
comments = soup.find_all(text=lambda text:isinstance(text, Comment))
try:
if comments[0] == ' Google Tag Manager ':
return False # has correct comment
else:
return True # has comments, but not correct one
except IndexError:
return True # has no comments
def get_html_files_in_dir(top_level_directory):
matches = []
for root, dirnames, filenames in os.walk(top_level_directory):
for filename in fnmatch.filter(filenames, '*.html'):
matches.append(os.path.join(root, filename))
return matches
my_html_files_path = '/home/azrad/whateveryouneedhere'
for full_file_name in get_html_files_in_dir(my_html_files_path):
soup = get_soup(full_file_name)
if needs_insertion(soup):
soup.body.insert(0, '<!-- Google Tag Manager --> <!-- End Google Tag Manager -->')
write_soup(full_file_name, soup)