Question

我有一个这样的数据集：

data = ['input2-2018-09-01--22-35-43', 'input1-2018-09-01--22-35-41', 'input1-2018-09-01--22-35-42', 'input1-2018-09-01--20-38-11', 'input1-2018-09-01--22-32-38']

我的目标是提取与同一时间相对应的数据，时间阈值为2秒。我的数据集比这个数据集大得多，大约有300个元素，因此我使用itertools.groupby将它们分组为时间间隔，并提取lenth> 1的数据集。我的代码（适合执行）是：

from itertools import groupby
from datetime import timedelta, datetime


data = ['input2-2018-09-01--22-35-43', 'input1-2018-09-01--22-35-41',
        'input1-2018-09-01--22-35-42', 'input1-2018-09-01--20-38-11',
        'input1-2018-09-01--22-32-38']

time_threshold = 2   # seconds
date_time = '2018-09-01'

def time_comparison(data, time_threshold):
    potential_detections = []

    # Make groups by time_threshold intervals
    def get_key(det):
        d = datetime.strptime(det[det.find('--')-len(date_time):],'%Y-%m-%d--%H-%M-%S')
        k = d + timedelta(seconds=-(d.second % time_threshold))
        return datetime(k.year, k.month, k.day, k.hour, k.minute, k.second)

    group = groupby(sorted(data), key=get_key)
    print(f'-------------{date_time}------------')

    # Iterate and extract coincidences
    for key, items in group:
        time_interval = []
        print('--------------------')
        print(key)
        print('---')
        for item in items:
            print(item)
            time_interval.append(item)
            if len(time_interval) > 1: 
               potential_detections.append(time_interval)

    return potential_detections

time_comparison(data, time_threshold)

输出为：

-------------2018-09-01------------
--------------------
2018-09-01 20:38:10
---
input1-2018-09-01--20-38-11
--------------------
2018-09-01 22:32:38
---
input1-2018-09-01--22-32-38
--------------------
2018-09-01 22:35:42
---
input1-2018-09-01--22-35-42
input2-2018-09-01--22-35-43
--------------------
2018-09-01 22:35:40
---
input1-2018-09-01--22-35-41

问题是，根据我的2秒阈值，必须仅在一个间隔内合并最后2个键：

2018-09-01 22:35:41
---
input1-2018-09-01--22-35-42
input2-2018-09-01--22-35-43
input1-2018-09-01--22-35-41

我该如何解决？这是对我的数据进行分类的正确方法吗？

非常感谢您。

Answer 1

您可以使用类来包装每个输入日期，以更好地进行排序：

import itertools, re, datetime
import functools
class _date:
  times = ['years', 'months', 'days', 'hours', 'minutes', 'seconds']
  def __init__(self, _date:str) -> None:
    self.dmy, self.hms = re.findall('(?<=^input\d\-)[\-\d]+(?=\-\-)|(?<=\-\-)[\d\-]+$', _date)
    self._original = _date
  @property
  def to_date(self):
    return datetime.datetime(*map(int, self.dmy.split('-')+self.hms.split('-')))
  @staticmethod
  def lt(a, b):
    r = a.to_date-b.to_date if a.to_date > b.to_date else b.to_date-a.to_date
    if sum(bool(getattr(r, i, 0)) for i in _date.times) > 1 or not getattr(r, 'seconds', None):
       return False
    return r.seconds <= 2
  def __eq__(self, _c):
    return _date.lt(self, _c)
  def __lt__(self, _d):
     return _date.lt(self, _d)
  def __repr__(self):
    return f'<{self.dmy}, {self.hms}>'


data = ['input2-2018-09-01--22-35-43', 'input1-2018-09-01--22-35-41', 'input1-2018-09-01--22-35-42', 'input1-2018-09-01--20-38-11', 'input1-2018-09-01--22-32-38']
new_data = [[re.findall('(?<=^input\d\-)[\d\-]+(?=\-\-)', i)[0], i] for i in data]
final_data = [[a, list(b)] for a, b in itertools.groupby(sorted(new_data, key=lambda x:x[0]), key=lambda x:x[0])]
for a, b in final_data:
  print(f'{"-"*10}{a}{"-"*10}')
  new_data = sorted([_date(i) for _, i in b], reverse=True)
  final_results = [[a, list(b)] for a, b in itertools.groupby(new_data)]
  for _, c in final_results:
    print('*'*20)
    print('\n'.join(i._original for i in c))
    print('*'*20)

输出：

----------2018-09-01----------
********************
input2-2018-09-01--22-35-43
input1-2018-09-01--22-35-41
input1-2018-09-01--22-35-42
********************
********************
input1-2018-09-01--20-38-11
********************
********************
input1-2018-09-01--22-32-38
********************

Python中时间间隔的itertools.groupby问题

1 个答案: