我有一个输入文件,而我必须根据2个空白的新行提取几行。
例如:文本文件如下所示。
1. Sometext
Sometext
Sometext
2. Sometext
Sometext
Sometext
3. Sometext
Sometext
Sometext
Sometext which is not needed
Sometext which is not needed
Sometext which is not needed
我必须从“1”中提取一个子串。在“2.”之前的所有人 和来自“2”的第二个子串在“3.”之前的所有人等等基于数字。我有下面的脚本获取输出,但它也获得了我不想要的所有“不需要的文本”。请参阅以下代码:
file_path = open("filename", "r")
content = file_path.read()
size1 = len(content)
start =0
a=1
b=2
end =0
ext =0
while (start<size):
if (end !=-1):
subString = content[content.find(str(a)+".")+0:content.find("\n"+str(b)+".")]
print (subString)
end = content.find(str(b)+".",start)
print ("\n")
a = int(a)+1 # increment to find the next start number
b = int(b)+1 # increment to find the next end number
start = end+1 # continuing to search the next
else:
break
所以,我决定为最终位置找到2个连续的空白行,并使用下面的空白行,但是没有用。
subString = content[content.find (str(a)+".")+3:content.find("\n\n")]
如果您有任何疑问,请帮忙告诉我。 提前谢谢。
答案 0 :(得分:0)
我不确定我是否正确理解了您的问题,但以下是输出的代码:
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
根据您问题中的文字。相反,你想要1到2是这样的整个子串:
['1. Sometext\nSometext\nSometext']
['2. Sometext\nSometext\nSometext']
['3. Sometext\nSometext\nSometext']
您应该将if语句更改为:
if is_number(i[0]):
substring = []
substring.append(i)
print(substring)
否则你可以使用下面的代码
def is_number(string):
try:
float(string)
return True
except ValueError:
return False
with open('testing.txt', 'r') as f:
content = f.read().split('\n\n')
for i in content:
if is_number(i[0]):
c = i.split('\n')
substring = [line[3:] if is_number(line[0]) else line for line in c]
print(substring)
答案 1 :(得分:0)
您必须在结尾处过滤掉不需要的行,但这会让您想要:
from itertools import groupby
with open("in.txt") as f:
grps = groupby(f, key=lambda x: bool(x.strip()))
print([list(v) for k,v in grps if k])
输出:
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]
由于您要保留的所有部分都以数字开头:
from itertools import groupby, takewhile
with open("in.txt") as f:
grps = groupby(f, key=lambda x: bool(x.strip()))
print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))
输出:
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
如果您知道有n
个组可以切片:
from itertools import groupby, islice
with open("in.txt") as f:
grps = groupby(f, key=lambda x: bool(x.strip()))
print (list(islice((list(v) for k,v in grps if k),3)))
输出:
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]