Question

我有一个运行的脚本，应该在目录中生成许多新的.html文件。它以代码0退出，表示没有问题，据我所知，应该正常工作。但它不是！：）

代码应遍历.html文件的目录，并查找HTML文件中两个元素之间的所有文本（这些元素存储在变量start和end中）。

import os

dir = os.listdir("C:/Users/folder")

files = []

for file in dir:
    if file[-5:] == '.html':
        files.insert(0, file)


for fileName in files:
    print fileName
    file = open("C:/Users/folder/" + fileName)
    content = file.read()
    file.close()

    start = content.find('<div class="title">')
    end = content.find('<div class="footer">')

    if start != -1:
        newContent = content[start:]
    if end != -1:
        newContent = content[0:end - 1]

    file = open(fileName + "_mod", 'w')
    file.write(newContent)
    file.close()

因此，这应该迭代一个目录，找到'<div class="title">'和'<div class="footer">'之间的文本，将此文本保存在一个变量中，该变量将放入一个与原始文件同名的新文件中。 _mod“附加到最后。

然而，这不是它的表现。相反，它从文档的开头直到页脚div。

所以我希望它转换来自：

的给定html文件

<head>
   <title>This is bad HTML</title>
</head>
<body>
  <h1> Remove me</h1>
  <div class="title">
    <h1> This is the good data, keep me</h1>

    <p> Keep this text </p>

  </div>
  <div class="footer">
    <h1> Remove me, I am pointless</h1>
  </div>
</body>

进入公正：

  <div class="title">
    <h1> This is the good data, keep me</h1>

    <p> Keep this text </p>

  </div>

但我目前得到的输出是：

<head>
   <title>This is bad HTML</title>
</head>
<body>
  <h1> Remove me</h1>
  <div class="title">
    <h1> This is the good data, keep me</h1>

    <p> Keep this text </p>

  </div>

我在这里犯了什么逻辑错误？

Answer 1

你的缺陷在于：

if start != -1:
    newContent = content[start:]
if end != -1:
    newContent = content[0:end - 1]

如果start!=-1和end != -1，则newContent将仅为content[0:end - 1]

您可以执行类似

的操作

start = start if (start != -1) else 0
end = end if (end != -1) else len(content)

然后

newContent = content[start:end -1]

请解释使用基本Python解析.html文件时的逻辑错误

1 个答案: