Question

我有一个文本文件，其信息按以下格式划分为块：

start1
loads of text
end1
start2
loads of text
end2

我需要做的是查找块的开头，然后解析块内的文本直到它结束。我的理解（可能是错误的）是我需要2个for循环。首先查找块的开始，然后第二个查找块中的信息。我无法弄清楚如何从第一个循环结束的行开始第二个循环？无论我做什么，它总是从文件的开头开始。这是我所拥有的片段。

for line in s:
    if "start1" in line:
        print("started")
        ...second for loop...
    elif "end1" in line:
        print("finished")

Answer 1

很容易......你可以继续使用相同的迭代器。最大的问题是你的起点和终点分隔符并不是唯一的。我不知道这是否只是你熟悉的例子或是否还有更多。关于分隔符的事情是它们需要是可预测的，并且它们也不能驻留在被分隔的代码中。

假设您还不关心分隔符部分...这将通过该文件。请注意，您需要一个通用迭代器来实现此目的：

iter_s = iter(s)
for line in iter_s:
    if "start1" in line:
        print("started")
        for line in iter_s:
            if "end1" in line:
                print("finished")
            else:
                print("got a line")

<强>更新

我的原始代码适用于文件但不适用于列表。在进入for循环之前，我将其更改为抓取迭代器。有一个问题是为什么需要iter_s = iter(s)来实现这一点。事实上，并非所有对象都需要它。假设s是文件对象。文件对象充当它们自己的迭代器，因此您可以获得任意数量的文件对象，它们实际上是相同的文件对象，每个文件对象都会抓住下一行。

>>> f=open('deleteme.txt', 'w')
>>> iter_f = iter(f)
>>> id(iter_f) == id(f)
True
>>> type(f)
<class '_io.TextIOWrapper'>
>>> type(iter_f)
<class '_io.TextIOWrapper'>
>>> f.close()

其他序列定义了自己独立工作的迭代器。因此，对于列表，每个迭代器将从顶部开始。在这种情况下，每个迭代器就像列表中的一个单独的游标。

>>> l=[]
>>> iter_l = iter(l)
>>> id(iter_l) == id(l)
False
>>> type(l)
<class 'list'>
>>> type(iter_l)
<class 'list_iterator'>

当for循环开始时，它会获取其对象的迭代器，然后运行它。如果它的对象已经是迭代器，它只是使用它。这就是为什么首先抓住迭代器的原因。

要确保使用两种类型的序列，请抓取迭代器。

Answer 2

您希望为此使用while循环：

line = file.readLine()
while line != '':
    if "start1" in line:
        print("started")
        while not "end1" in line and line != '':
            print("Read a line.")
            line = file.readLine()
        print("Finished")

这应该给出预期的结果。

Answer 3

这有用吗？

filename = "file to open"
with open(filename) as f:
    for line in f:
        if line == "start":
            print("started")
        elif line == "end":
            print("finished")
        else:
            print("this is just an ordinary text")
            # Do whatever here

Answer 4

根据您对数据的处理方式，此类内容可能会有用。

def readit(filepath):
    with open(filepath) as thefile:
        data = []
        sentinel= 'end1'
        for line in thefile:
            if line.startswith('start'):
                sentinel= 'end' + line.rstrip()[-1] #the last char (without the newline)
            elif line.rstrip() == sentinel:  # again, the rstrip is to drop the newline char
                yield data
                data = []
            else:
                data.append(line)

这是一个生成器，它返回＆＃39; start＆＃39;之间的所有数据。并且＆＃39;结束＆＃39;每次调用它时的值。

您可以这样使用它：

>>> generator = readit()
>>> next(generator)
['loads of text\n']
>>> next(generator)
['more text\n']

这是我的数据文件的样子：

start1
loads of text
end1
start2
more text
end2

Answer 5

编辑：不是什么OP寻找什么。这是正确的解决方案：

# One of the most versatile built-in Python libraries for string manipulation.
import re

text = "your text here"

start = -1
end = 0

# enumerate() allows you to get both indexes and lines
for i, line in enumerate(text.splitlines()):

    if re.search("start[1-9][0-9]*", line) and start < end:
        start = i

    elif re.search("end[1-9][0-9]*", line) and end < start:
        end = i
        myparser("\n".join(text.splitlines()[start+1:end]))

def myparser(string):
    ...

Here您会找到有关re的更多信息。

Answer 6

我在你的评论中看到你将使用RegEx来解析块...所以为什么你不想使用RegEx来解析块：

from __future__ import absolute_import

import re


def parse_blocks(txt, blk_begin_re=r'start[\d]*', blk_end_re=r'end[\d]*', re_flags=re.I | re.M):
    """
    parse text 'txt' into blocks, beginning with 'blk_begin_re' RegEx
        and ending with 'blk_end_re' RegEx

    returns tuple(parsed_block_begin, parsed_block, parsed_block_end)
    """
    pattern = r'({0})(.*?)({1})'.format(blk_begin_re, blk_end_re)
    return re.findall(pattern, txt, re_flags)

# read file into 'data' variable
with open('text.txt', 'r') as f:
    data = f.read()

# list all parsed blocks
for blk_begin, blk, blk_end in parse_blocks(data, r'start[\d]*', r'end[\d]*', re.I | re.S):
    # print line separator
    print('=' * 60)
    print('started block: [{}]'.format(blk_begin))
    print(blk)
    print('ended block: [{}]'.format(blk_end))

Python继续阅读文件

6 个答案: