如何从Python中的文本数据文件中提取数据子集

时间:2017-09-27 01:55:23

标签: python file

我有一个文本文件,其中相关数据(行x列)仅出现在“开始”和“结束”关键字之间。见下文。我想编写一个可以提取这些数据子集的代码。如果一行以“start”开头,后跟数据,但后面没有后续的“结束”关键字,那么我想忽略该数据。在下面的示例中,data1和data3是相关的,但data2不是因为它没有被“start”和“end”关键词包围。

start
data1 (matrix of data) /relevant because data1 is enclosed by "start" and "end"
end
start
data2 (matrix of data) /not relevant because there is no "end"
. 
start
data3 (matrix of data) /relevant for same reason as for data1
end
.
.
and so on

我以为我可以从:

开始
with open(file_path,'r') as file:

    text = file.readlines()
    start_indexes = []
    end_indexes = []

    for i, line in enumerate(text):
        if line.startswith('start'):
            start_indexes.append(i)
        elif line.startswith('end'):
            end_indexes.append(i)

    for i in range(len(start_indexes)):
        for j in range(len(end_indexes)):
            if (start_indexes[i] < end_indexes[j] < start_indexes[i+1]):
                print(start_indexes[i],end_indexes[j])

上面的代码给出了起始行号和有相关数据的结束行号。这是我有点卡住的地方。我现在如何提取相关数据?在下面的示例中,它将是data1,data3。我是否以“正确”的方式解决问题?我应该求助于大熊猫吗?是否有更有效率和更直接的方式?

4 个答案:

答案 0 :(得分:0)

嵌套循环?

您正在浏览开始和结束范围的每个组合。您只需要与同一条数据相对应的那些。

用以下内容替换你的for循环:

for start, end in zip(start_indexes, end_indexes):
    print(text[start + 1:end])

zip(a, b, ...)会返回一个包含a, b, ...列的新列表,主要是[(a[0], b[0], ...), (a[1], b[1], ...), ...]。您浏览start_indexes, end_indexes的每一列,给出相应的开始和结束值,然后使用列表切片访问来获取这些行的数据。

答案 1 :(得分:0)

我会通过顺序读取文件来使用另一种方式(这假设&#34;开始&#34; - &#34;结束&#34; -block的数据太大)。我会创建一个read变量来收集当前块的数据(无论是否相关)和带有状态转换的readline变量。

一些伪Python:

io.open

答案 2 :(得分:0)

我个人认为正则表达式是处理这种情况的最佳方式:

import re woof0='''start data1 (matrix of data) /relevant because data1 is enclosed by "start" and "end" end start data2 (matrix of data) /not relevant because there is no "end" . start data3 (matrix of data) /relevant for same reason as for data1 end . . and so on ''' re.findall(r'start(\sdata.*|\Sdata.*)\nend',woof0)

<强>输出:

['\ndata1 (matrix of data) /relevant because data1 is enclosed by "start" and "end"', '\ndata3 (matrix of data) /relevant for same reason as for data1']

答案 3 :(得分:0)

设定:

s = '''start
data1 (matrix of data) /relevant because data1 is enclosed by "start" and "end"
end
start
data2 (matrix of data) /not relevant because there is no "end"
start
data3 (matrix of data) /relevant for same reason as for data1
end
start
data4 blah
'''
import io
f = io.StringIO(s)

遍历文件,测试每行开头的内容;找出将有效数据块放在子列表中并将它们附加到结果列表所需的逻辑......

result = []
sub = []

for line in f:
    if line.startswith('start'):
        # possible new data block
        if sub:
            # if it isn't empty it must contain
            # a start --> data block with no end
            result.append(sub)
            sub = []
        sub = [line]
    elif line.startswith('end'):
        # start over
        sub = []
    elif line.startswith('data'):
        sub.append(line)
    else:
        # for lines that don't startwith data, start or end - if any 
        sub.append(line)

if sub:
    # start --> data --> EOF or end of string
    result.append(sub)