Question

目前我正在使用此代码：

from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()

def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    with stdout2file("output.txt"):
        for file in glob.iglob('**/*.html', recursive=True):
            with open(file, encoding="utf8") as f:
                contents = f.read()
                soup = BeautifulSoup(contents, "html.parser")
                for item in soup.findAll("ix:nonfraction"):
                    if re.match(".*AuditFeesExpenses", item['name']):
                        print(file.split(os.path.sep)[-1], end="| ")
                        print(item['name'], end="| ")
                        print(item.get_text())
trade_spider()

到目前为止，这完美无缺。但现在我又陷入了另一个问题。如果我在没有子文件夹但只有文件的文件夹中搜索没有问题。但是，如果我尝试在具有子文件夹的文件夹上运行此代码，则它不起作用（它什么都不打印！）。此外，我想将我的结果打印到.txt文件中，而不包含整个路径。结果应该是：

Filename.html| RegEX Match| HTML text

我已经得到了这个结果，但只在PyCharm中而不是在单独的.txt文件中。

总之，我有2个问题：

如何浏览我定义的目录中的子文件夹？ - ＆GT; os.walk（）会选择吗？
如何将结果打印到.txt文件中？ - ＆GT;将sys.stdout用于此吗？

在这个问题上有任何帮助！

更新：它只将第一个文件的第一个结果打印到我的＆＃34; outout.txt＆＃34; file（至少我认为它是第一个，因为它是我唯一的子文件夹中的最后一个文件，并且recursive = true被激活）。知道为什么它没有循环遍历所有其他文件吗？

UPDATE_2：问题解决了！最终守则可以在上面看到！

Answer 1

对于在子目录中行走，有两种选择：

将**与glob和参数recursive=True（glob.glob('**/*.html')）一起使用。这仅适用于Python 3.5+。如果目录树很大，我还建议使用glob.iglob而不是glob.glob。
使用os.walk并手动或使用".html"检查文件名（是否以fnmatch.filter结尾）。

关于打印到文件中，还有几种方法：

只需执行脚本并重定向标准输出，即python3 myscript.py >myfile.txt
在写入模式下调用文件对象的print方法替换对.write()的调用。
继续使用print，但为其提供参数file=myfile，其中myfile也是可写文件对象。

编辑：也许最不显眼的方法如下。首先，将其包含在某处：

import contextlib
@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()

然后，在你循环文件的行的前面，添加这一行（并适当地缩进）：

with stdout2file("output.txt"):

打开目录中的每个文件/子文件夹，并将结果打印到.txt文件

1 个答案: