Question

我是python的新手并尝试执行以下操作的程序：

打开目录路径中的所有文件夹和子文件夹
识别HTML文件
在BeautifulSoup中加载HTML
找到第一个正文标记
如果身体标签后面紧跟着＆lt; Google跟踪代码管理器＆gt;然后继续
如果没有，则添加＆lt; Google跟踪代码管理器＆gt;代码并保存文件。

我无法扫描每个文件夹中的所有子文件夹。如果＆lt;我不能设置seen（） Google跟踪代码管理器＆gt;在身体标签后立即出现。任何帮助执行上述任务表示赞赏。

我的代码尝试如下：

  import sys
  import os
  from os import path
  from bs4 import BeautifulSoup

  directory_path = '/input'

  files = [x for x in os.listdir(directory_path) if path.isfile(directory_path+os.sep+x)]

  for root, dirs, files in os.walk(directory_path):
      for fname in files:
      seen = set()
      a = directory_path+os.sep+fname
      if fname.endswith(".html"):
          with open(a) as f:
              soup = BeautifulSoup(f)
              for li in soup.select('body'):
                if li in seen:
                    continue
                else:
                    seen.add("<!-- Google Tag Manager --><noscript><iframe src='//www.googletagmanager.com/ns.html?id=GTM-54QWZ8'height='0' width='0' style='display:none;visibility:hidden'></iframe></noscript><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-54QWZ8');</script><!-- End Google Tag Manager —>\n")

Answer 1

所以你可以为python安装iglob库。使用iglob，您可以递归遍历指定的主目录和子目录，并列出具有给定扩展名的所有文件。然后打开HTML文件，读取所有行，手动遍历行直到找到标记的“”，因为可能使用框架工作的一些用户可能在body标记内有其他内容。无论哪种方式，循环查找正文标记开头的行，然后检查下一行，如果指定“Google跟踪代码管理器”的文本不在下一行，则将其写出。请记住，如果您在身体标记后面始终拥有Google跟踪代码管理器代码，我会写这封信。

请记住：

如果Google跟踪代码管理器文本没有直接在正文标记之后，则此代码会添加它，因此如果Google代码管理器位于Two body标记的某个位置并且有效，则可能会破坏Google的功能标签管理员。
我正在使用Python 3.x，所以如果您使用的是Python 2，则可能需要将其转换为该版本的python。
将“Path.html”替换为变量路径，以便重写其正在查看的文件并进行修改。我输入'path.html'以便在编写脚本时能够看到输出并与原始文件进行比较。

以下是代码：

import glob

types = ('*.html', '*.htm') 
paths = []

for fType in types:
    for filename in glob.iglob('./**/' + fType, recursive=True):
        paths.append(filename)
#print(paths)

for path in paths:
    print(path)
    with open(path,'r') as f:
        lines = f.readlines()

    with open(path, 'w') as w:
        for i in range(0,len(lines)):
            w.write(lines[i])
            if "<body>" in lines[i]:
                if "<!-- Google Tag Manager -->" not in lines[i+1]:
                    w.write('<!-- Google Tag Manager --> <!-- End Google Tag Manager -->\n')

Answer 2

我接受它，可能会有一些错误：

已编辑添加：我已经意识到此代码无法确保是<body>之后的第一个标记，而是确保它是<body>之后的第一个注释。这不是问题所要求的。

import fnmatch
import os
from bs4 import BeautifulSoup, Comment
from HTMLParser import HTMLParser


def get_soup(filename):
  with open(filename, 'r') as myfile:
    data=myfile.read()
  return BeautifulSoup(data, 'lxml')


def write_soup(filename, soup):
  with open(filename, "w") as file:
    output = HTMLParser().unescape(soup.prettify())
    file.write(output)


def needs_insertion(soup):
  comments = soup.find_all(text=lambda text:isinstance(text, Comment))
  try:
    if comments[0] == ' Google Tag Manager ':
      return False  # has correct comment
    else:
      return True  # has comments, but not correct one
  except IndexError:
    return True  # has no comments


def get_html_files_in_dir(top_level_directory):
  matches = []
  for root, dirnames, filenames in os.walk(top_level_directory):
      for filename in fnmatch.filter(filenames, '*.html'):
          matches.append(os.path.join(root, filename))
  return matches


my_html_files_path = '/home/azrad/whateveryouneedhere'
for full_file_name in get_html_files_in_dir(my_html_files_path):
  soup = get_soup(full_file_name)

  if needs_insertion(soup):
    soup.body.insert(0, '<!-- Google Tag Manager --> <!-- End Google Tag Manager -->')
    write_soup(full_file_name, soup)

在Python

2 个答案: