Question

使用python，我想在字典中读取特定字符串后面的文本文件中的所有行。我想在成千上万的文本文件中做到这一点。

我可以使用以下代码识别并打印出特定字符串（'Abstract'）（来自this stack overflow answer）：

for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:
                print line;

但是我怎么告诉python开始读取仅在字符串后面的行？

Answer 1

当您到达想要开始的行时，只需启动另一个循环：

for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:                
                for line in f: # now you are at the lines you want
                    # do work

文件对象是它自己的迭代器，所以当我们到达包含Abstract的行时，我们继续从该行继续迭代，直到我们使用了迭代器。

一个简单的例子：

gen  =  (n for n in xrange(8))

for x in gen:
    if x == 3:
        print("starting second loop")
        for x in gen:
            print("In second loop",x)
    else:
        print("In first loop", x)

In first loop 0
In first loop 1
In first loop 2
starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7

您还可以使用itertools.dropwhile来消费直到您想要的点。

from itertools import dropwhile

for files in filepath:
    with open(files, 'r') as f:
        dropped = dropwhile(lambda _line: "Abstract" not in _line, f)
        next(dropped,"")
        for line in dropped:
                print(line)

Answer 2

使用布尔值忽略到此为止的行：

found_abstract = False
for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:
                found_abstract = True
            if found_abstract:
                #do whatever you want

Answer 3

您可以在此处使用itertools.dropwhile和itertools.islice，这是一个伪示例：

from itertools import dropwhile, islice

for fname in filepaths:
    with open(fname) as fin:
        start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
        for line in islice(start_at, 1, None): # ignore the line still with Abstract in
            print line

Answer 4

为了澄清，您的代码已经“读取”所有行。要在某个点之后开始“注意”线条，你可以设置一个布尔标志来指示是否应该忽略线条，并在每一行检查它。

pay_attention = False
for line in f:
    if pay_attention:
        print line
    else:  # We haven't found our trigger yet; see if it's in this line
        if 'Abstract' in line:
            pay_attention = True

如果你不介意重新安排你的代码，你也可以使用两个部分循环：一个循环，一旦你找到你的触发短语（'Abstract'）就终止，一个读取以下所有行。这种方法稍微清洁一点（并且速度非常快）。

for skippable_line in f:  # First skim over all lines until we find 'Abstract'.
    if 'Abstract' in skippable_line:
        break
for line in f:  # The file's iterator starts up again right where we left it.
    print line

这样做的原因是open返回的文件对象的行为类似于generator，而不是列表：它只会在请求时生成值。因此，当第一个循环停止时，文件将保留其内部位置设置在第一个“未读”行的开头。这意味着当您进入第二个循环时，您看到的第一行是触发break的第一行。

Answer 5

对我来说，以下代码更容易理解。

with open(file_name, 'r') as f:
    while not 'Abstract' in next(f):
        pass
    for line in f:
        #line will be now the next line after the one that contains 'Abstract'

Answer 6

猜测字典是如何涉及的，我会这样写：

lines = dict()
for filename in filepath:
   with open(filename, 'r') as f:
       for line in f:
           if 'Abstract' in line:
               break
       lines[filename] = tuple(f)

因此，对于每个文件，您的字典都包含一个行元组。

这是有效的，因为循环读取并包含您标识的行，使文件中的其余行准备好从f读取。

如何只使用python读取某个字符串后的文本文件中的行？

6 个答案: