Python - 从文本文件中提取字符串,直到前2个新行空间

时间:2016-01-26 14:23:58

标签: python python-3.x

我有一个输入文件,而我必须根据2个空白的新行提取几行。

例如:文本文件如下所示。

1. Sometext
Sometext 
Sometext

2. Sometext
Sometext
Sometext

3. Sometext
Sometext
Sometext

Sometext which is not needed
Sometext which is not needed
Sometext which is not needed

我必须从“1”中提取一个子串。在“2.”之前的所有人 和来自“2”的第二个子串在“3.”之前的所有人等等基于数字。我有下面的脚本获取输出,但它也获得了我不想要的所有“不需要的文本”。请参阅以下代码:

file_path = open("filename", "r")
content = file_path.read()
size1 = len(content)
start =0
a=1
b=2
end =0
ext =0   

while (start<size):
   if (end !=-1):
   subString = content[content.find(str(a)+".")+0:content.find("\n"+str(b)+".")] 
   print (subString)
   end = content.find(str(b)+".",start)
                print ("\n")
                a = int(a)+1 # increment to find the next start number
                b = int(b)+1 # increment to find the next end number
                start = end+1 # continuing to search the next
            else:
                break

所以,我决定为最终位置找到2个连续的空白行,并使用下面的空白行,但是没有用。

subString = content[content.find (str(a)+".")+3:content.find("\n\n")]

如果您有任何疑问,请帮忙告诉我。 提前谢谢。

2 个答案:

答案 0 :(得分:0)

我不确定我是否正确理解了您的问题,但以下是输出的代码:

['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']

根据您问题中的文字。相反,你想要1到2是这样的整个子串:

['1. Sometext\nSometext\nSometext']
['2. Sometext\nSometext\nSometext']
['3. Sometext\nSometext\nSometext']

您应该将if语句更改为:

if is_number(i[0]):
            substring = []
            substring.append(i)
            print(substring)

否则你可以使用下面的代码

def is_number(string):
    try:
        float(string)
        return True
    except ValueError:
        return False

with open('testing.txt', 'r') as f:
content = f.read().split('\n\n')
for i in content:
    if is_number(i[0]):
        c = i.split('\n')
        substring = [line[3:] if is_number(line[0]) else line for line in c]
        print(substring)

答案 1 :(得分:0)

您必须在结尾处过滤掉不需要的行,但这会让您想要:

from itertools import groupby
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print([list(v) for k,v in grps if k])

输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]

由于您要保留的所有部分都以数字开头:

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))

输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]

如果您知道有n个组可以切片:

from itertools import groupby, islice
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(islice((list(v) for k,v in grps if k),3)))

输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]