Question

这是一个相当普遍的问题，我甚至不知道这是否是问题的正确社区，如果不是只是告诉我。

我最近有一个html文件，我从中提取了~90行HTML代码（总行数约为8000）。我用一个简单的Python脚本做到了这一点。我将输出（缩短的html代码）存储到文本文件中。现在我很好奇，因为文件大小增加？在提取某些部分之后，是什么原因导致文件变大？

之前的文件大小：319.374字节文件大小：321.516字节

这是因为html和txt的文件格式不同吗？

任何帮助或建议表示赞赏！

代码：

import glob
import os
import re


def extractor():
    os.chdir(r"F:\Test")  # the directory containing my html
    for file in glob.iglob("*.html"):  # iterates over all files in the directory ending in .html
        with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
            contents = f.read()
            extract = re.compile(r'StartTag.*?EndTag', re.S)
            cut = extract.sub('', contents)
            if re.search(extract, contents) is not None:
                out.write(cut)
            out.close()
extractor()

编辑：我也尝试使用“.html”而不是“.txt”作为输出文件的filem格式。然而，差异仍然存在。

Answer 1

此代码不会写入原始HTML文件。其他东西必须导致文件大小增加。

提取后文件大小会增加吗？

1 个答案: