Question

我正在提取HTML 文档的某个部分（公平地说：这是iXBRL文档的基础，这意味着我有很多编写的格式代码< / strong> inside）并将我的输出，原始文件不带提取的部分写入.txt文件。我的目标是测量文档大小的差异（原始文档的KB数量是指提取的部分）。据我所知，HTML与文本格式不应有任何区别，因此我的差异应该是可靠的，尽管我正在比较两种不同的文档格式。到目前为止我的代码是：

import glob import os import contextlib import re @contextlib.contextmanager def stdout2file(fname): import sys f = open(fname, 'w') sys.stdout = f yield sys.stdout = sys.__stdout__ f.close() def extractor(): os.chdir(r"F:\Test") with stdout2file("FileShortened.txt"): for file in glob.iglob('*.html', recursive=True): with open(file) as f: contents = f.read() extract = re.compile(r'(This is the beginning of).*?Until the End', re.I | re.S) cut = extract.sub('', contents) print(file.split(os.path.sep)[-1], end="| ") print(cut, end="\n") extractor()

注意：我不使用BS4或lxml，因为我不仅对HTML文本感兴趣，而且实际上在我的开始和结束RegEx之间的所有行中。所有格式化代码行。

我的代码工作没有问题，但是由于我有很多文件，我的FileShortened.txt文档很快就会变得庞大。我的问题不在于文件或提取，而在于将输出重定向到各种txt文件。现在，我将所有内容整合到一个文件中，我需要的是搜索的每个文件的某种＆＃34; ，创建与原始文档同名的新txt文件＆＃34;条件（arcpy模块？！）？
有点像：

File1.html - ＆gt; File1Short.txt

File2.html - ＆gt; File2Short.txt ...

有一种简单的方法（不会过多地改变我的代码）在打印＆＃34; RegEx匹配＆＃34;的意义上反转我的代码一个新的.txt文件，而不是＆＃34; 除了我的RegEx匹配＆＃34;

任何帮助表示赞赏！

Answer 1

好的，我明白了。最终守则是：

import glob
import os
import re
from os import path


def extractor():
    os.chdir(r"F:\Test")  # the directory containing my html
    for file in glob.glob("*.html"):  # iterates over all files in the directory ending in .html
        with open(file) as f, open((file.rsplit(".", 1)[0]) + ".txt", "w") as out:
            contents = f.read()
            extract = re.compile(r'Start.*?End', re.I | re.S)
            cut = extract.sub('', contents)
            out.write(cut)
            out.close()
extractor()

从纯HTML中提取文本并写入新文件

1 个答案: