Question

假设我有一个包含以下内容的文本文件

fdsjhgjhg
fdshkjhk
 Start
     Good Morning
     Hello World
 End
dashjkhjk
dsfjkhk
Start
  hgjkkl
  dfghjjk
  fghjjj
Start
   Good Evening
   Good 
End

我写了以下代码：

infile = open('test.txt','r')
outfile= open('testt.txt','w')
copy = False
for line in infile:
    if line.strip() == "Start":
        copy = True
    elif line.strip() == "End":
        copy = False
    elif copy:
        outfile.write(line)

我在outfile中有这个结果：

     Good Morning
     Hello World
     hgjkkl
     dfghjjk
     fghjjj
     Good Evening
     Good

我的问题是我只想在开始和结束之间获取数据，而不是在开始和开始之间或结束和结束之间的数据

Answer 1

很大的问题！这是一个存储桶问题，每个开始都需要结束。

你得到结果的原因是因为连续两次'开始'。

最好在某处存储信息，直到触发“结束”。

infile = open('scores.txt','r')
outfile= open('testt.txt','w')
copy = False
for line in infile:

    if line.strip() == "Start":
        bucket = []
        copy = True

    elif line.strip() == "End":
        for strings in bucket:
            outfile.write( strings + '\n')
        copy = False

    elif copy:
        bucket.append(line.strip())

Answer 2

您可以保留一个临时的行列表，只有在您知道某个部分符合您的条件后才能提交它们。也许尝试以下内容：

infile = open('test.txt','r')
outfile= open('testt.txt','w')
copy = False
tmpLines = []
for line in infile:
    if line.strip() == "Start":
        copy = True
        tmpLines = []
    elif line.strip() == "End":
        copy = False
        for tmpLine in tmpLines:
            outfile.write(tmpLine)
    elif copy:
        tmpLines.append(line)

这给出了输出

     Good Morning
     Hello World
 Good Evening
 Good

Answer 3

这是一种使用正则表达式的hacky但可能更直观的方法。它会找到＆＃34;开始＆＃34;之间存在的所有文本。和＆＃34;结束＆＃34;对，并且print语句将它们修剪掉。

import re 
infile = open('test.txt','r')
text = infile.read() 

matches = re.findall('Start.*?End',text)
for m in matches: 
    print m.strip('Start ').strip(' End')

Answer 4

您可以使用正则表达式执行此操作。这将排除流氓Start和End行。这是RegEx.info

import re

f = open('test.txt','r')
txt = f.read()
matches = re.findall(r'^\s*Start\s*$\n((?:^\s*(?!Start).*$\n)*?)^\s*End\s*$', txt, flags=re.M)

Answer 5

如果您不希望获得嵌套结构，可以这样做：

# match everything between "Start" and "End"
occurences = re.findall(r"Start(.*?)End", text, re.DOTALL)
# discard text before duplicated occurences of "Start"
occurences = [oc.rsplit("Start", 1)[-1] for oc in occurences]
# optionally trim whitespaces
occurences = [oc.strip("\n") for oc in occurences]

打印

>>> for oc in occurences: print(oc)
     Good Morning
     Hello World
   Good Evening
   Good

如果您愿意，可以将\n添加为Start和End的一部分

提取文本文件中两个字符串之间的值

5 个答案: