如果它们位于时间窗内,我如何仅获得具有最高价值的那些行?

时间:2016-07-05 09:51:32

标签: python rows

我是python和脚本的新手,所以我非常感谢编写python脚本的一些指导。 所以,到了这一点:

我在目录中有大量文件。有些文件是空的,其他文件包含这样的行:

16 2009-09-30T20:07:59.659Z 0.05 0.27 13.559 6
16 2009-09-30T20:08:49.409Z 0.22 0.312 15.691 7
16 2009-09-30T20:12:17.409Z -0.09 0.235 11.826 4
16 2009-09-30T20:12:51.159Z 0.15 0.249 12.513 6
16 2009-09-30T20:15:57.209Z 0.16 0.234 11.776 4
16 2009-09-30T20:21:17.109Z 0.38 0.303 15.201 6
16 2009-09-30T20:23:47.959Z 0.07 0.259 13.008 5
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
16 2009-09-30T20:37:48.609Z -0.02 0.256 12.861 4
16 2009-09-30T20:44:19.359Z 0.14 0.251 12.597 4
16 2009-09-30T20:48:39.759Z 0.03 0.284 14.244 5
16 2009-09-30T20:49:36.159Z -0.07 0.278 13.98 4
16 2009-09-30T20:57:54.609Z 0.01 0.304 15.294 4
16 2009-09-30T20:59:47.759Z 0.27 0.265 13.333 4
16 2009-09-30T21:02:56.209Z 0.28 0.272 13.645 6

等等。

我想将这些文件中的这些行放到一个新文件中。但是有一些条件! 如果两个或多个连续行位于6秒的时间窗口内,则只应将具有最高阈值的行打印到新文件中。

所以,就像这样:

原件:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5

输出文件中的


16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5

请记住,来自不同文件的行可能在6s窗口内有来自其他文件的行,因此输出中的行是来自不同文件的阈值最高的行。

解释行内容的代码在这里:

import glob 
from datetime import datetime

path = './*.cat'   
files=glob.glob(path)   
for file in files:  

    in_file=open(file, 'r')  
    out_file = open("times_final", "w")

    for line in in_file.readlines():
        split_line = line.strip().split(' ')
        template_number = split_line[0]
        t = datetime.strptime(split_line[1], '%Y-%m-%dT%H:%M:%S.%fZ')
        mag = split_line[2]
        num = split_line[3]
        threshold = float(split_line[4])
        no_detections = split_line[5]

in_file.close()
out_file.close()

非常感谢提示,指南......

1 个答案:

答案 0 :(得分:0)

您在评论中说过,您知道如何将多个文件合并为1个按t排序,并且6秒窗口从第一行开始,并且基于实际数据。

因此,您需要一种方法来记住每个窗口的最大阈值,并且只有在确定处理了窗口中的所有行之后才能写入。样本实施:

from datetime import datetime, timedelta
from csv import DictReader, DictWriter

fieldnames=("template_number", "t", "mag","num", "threshold", "no_detections")
with open('master_data') as f_in, open("times_final", "w") as f_out:
    reader = DictReader(f_in, delimiter=" ", fieldnames=fieldnames)
    writer = DictWriter(f_out, delimiter=" ", fieldnames=fieldnames,
                        lineterminator="\n")
    window_start = datetime(1900, 1, 1)
    window_timedelta = timedelta(seconds=6)
    window_max = 0
    window_row = None
    for row in reader:
        try:
            t = datetime.strptime(row["t"], "%Y-%m-%dT%H:%M:%S.%fZ")
            threshold = float(row["threshold"])
        except ValueError:
            # replace by actual error handling
            print("Problem with: {}".format(row))    
        # switch to new window after 6 seconds
        if t - window_start > window_timedelta:
            # write out previous window before switching
            if window_row:
                writer.writerow(window_row)
            window_start = t
            window_max = threshold
            window_row = row
        # remember max threshold inside a single window
        elif threshold > window_max:
            window_max = threshold
            window_row = row
    # don't forget the last window
    if window_row:
        writer.writerow(window_row)