Question

我使用以下函数从.txt文件中提取<html>和<\html>部分之间使用以下函数找到的所有文本：

def html_part(filepath):
"""
Generator returning only the HTML lines from an
SEC Edgar SGML multi-part file.
"""
start, stop = '<html>\n', '</html>\n'
filepath = os.path.expanduser(filepath)
with open(filepath) as f:
    # find start indicator, yield it
    for line in f:
        if line == start:
            yield line
            break
    # yield lines until stop indicator found, yield and stop
    for line in f:
        yield line
        if line == stop:
            raise StopIteration

此功能的问题在于它只抓取<html>和<\html>之间的第一部分。但.txt文件中还有其他部分包含<html>和<\html>标记。如何调整上述功能以获取在<html>和<\html>标签之间找到的所有文字？可以找到示例.txt文件here。

当我执行上述功能时，我会这样做：

origpath = 'C:\\samplefile.txt'
htmlpath = origpath.replace('.txt', '.html')
with open(htmlpath, "w") as out:
     out.write(''.join(html_part(origpath)))

Answer 1

您需要以可以多次迭代相同参数的方式进行设置。此外，是否有必要将start和stop设为\n？如果<html>在没有换行符的情况下直接移动到以下代码，会发生什么？ HTML代码按照它的方式构建，因此如果需要，您可以在一行中编写所有内容。

因此，我首先要将您的start和stop变量更改为不包括\n。

start, stop = "<html>", "</html>"

接下来，调整循环以多次迭代相同的信息

with open(filepath) as f:
    # find start indicator, yield it
    switch = 0
    for line in f:
        if switch = 0:
            if start in line:
                yield line
                switch = 1
        elif switch = 1:
            yield line
            if stop in line:
                switch = 0
     raise StopIteration

Answer 2

您可以使用正则表达式：

import re

content = open("filepath.txt", "r").read()
htmlPart = re.findall("<html>.*?</html>", content)
htmlPart = [i[6:-7] for i in htmlPart]

Answer 3

这应该完成工作并将所有html部分分成一个.html文件

writing = False
html_file = open('my_file.html', 'a')
with open(origpath) as f:    
    for line in f:
        # find start indicator
        if line == start:
            writing = True
        if writing:
            html_file.write(line + '\n')
        # yield lines until stop indicator found
        if line == stop:
            writing = False

html_file.close()

Answer 4

使用像这样的正则表达式更简单，更好

import re
result = re.findall(r"(?si)<(html)[^>]*>(.*?)</\1>", filepath)

Python：抓取txt文件之间的所有部分

4 个答案: