我有一个这样的数据集:
data = ['input2-2018-09-01--22-35-43', 'input1-2018-09-01--22-35-41', 'input1-2018-09-01--22-35-42', 'input1-2018-09-01--20-38-11', 'input1-2018-09-01--22-32-38']
我的目标是提取与同一时间相对应的数据,时间阈值为2秒。我的数据集比这个数据集大得多,大约有300个元素,因此我使用itertools.groupby将它们分组为时间间隔,并提取lenth> 1的数据集。 我的代码(适合执行)是:
from itertools import groupby
from datetime import timedelta, datetime
data = ['input2-2018-09-01--22-35-43', 'input1-2018-09-01--22-35-41',
'input1-2018-09-01--22-35-42', 'input1-2018-09-01--20-38-11',
'input1-2018-09-01--22-32-38']
time_threshold = 2 # seconds
date_time = '2018-09-01'
def time_comparison(data, time_threshold):
potential_detections = []
# Make groups by time_threshold intervals
def get_key(det):
d = datetime.strptime(det[det.find('--')-len(date_time):],'%Y-%m-%d--%H-%M-%S')
k = d + timedelta(seconds=-(d.second % time_threshold))
return datetime(k.year, k.month, k.day, k.hour, k.minute, k.second)
group = groupby(sorted(data), key=get_key)
print(f'-------------{date_time}------------')
# Iterate and extract coincidences
for key, items in group:
time_interval = []
print('--------------------')
print(key)
print('---')
for item in items:
print(item)
time_interval.append(item)
if len(time_interval) > 1:
potential_detections.append(time_interval)
return potential_detections
time_comparison(data, time_threshold)
输出为:
-------------2018-09-01------------
--------------------
2018-09-01 20:38:10
---
input1-2018-09-01--20-38-11
--------------------
2018-09-01 22:32:38
---
input1-2018-09-01--22-32-38
--------------------
2018-09-01 22:35:42
---
input1-2018-09-01--22-35-42
input2-2018-09-01--22-35-43
--------------------
2018-09-01 22:35:40
---
input1-2018-09-01--22-35-41
问题是,根据我的2秒阈值,必须仅在一个间隔内合并最后2个键:
2018-09-01 22:35:41
---
input1-2018-09-01--22-35-42
input2-2018-09-01--22-35-43
input1-2018-09-01--22-35-41
我该如何解决?这是对我的数据进行分类的正确方法吗?
非常感谢您。
答案 0 :(得分:-1)
您可以使用类来包装每个输入日期,以更好地进行排序:
import itertools, re, datetime
import functools
class _date:
times = ['years', 'months', 'days', 'hours', 'minutes', 'seconds']
def __init__(self, _date:str) -> None:
self.dmy, self.hms = re.findall('(?<=^input\d\-)[\-\d]+(?=\-\-)|(?<=\-\-)[\d\-]+$', _date)
self._original = _date
@property
def to_date(self):
return datetime.datetime(*map(int, self.dmy.split('-')+self.hms.split('-')))
@staticmethod
def lt(a, b):
r = a.to_date-b.to_date if a.to_date > b.to_date else b.to_date-a.to_date
if sum(bool(getattr(r, i, 0)) for i in _date.times) > 1 or not getattr(r, 'seconds', None):
return False
return r.seconds <= 2
def __eq__(self, _c):
return _date.lt(self, _c)
def __lt__(self, _d):
return _date.lt(self, _d)
def __repr__(self):
return f'<{self.dmy}, {self.hms}>'
data = ['input2-2018-09-01--22-35-43', 'input1-2018-09-01--22-35-41', 'input1-2018-09-01--22-35-42', 'input1-2018-09-01--20-38-11', 'input1-2018-09-01--22-32-38']
new_data = [[re.findall('(?<=^input\d\-)[\d\-]+(?=\-\-)', i)[0], i] for i in data]
final_data = [[a, list(b)] for a, b in itertools.groupby(sorted(new_data, key=lambda x:x[0]), key=lambda x:x[0])]
for a, b in final_data:
print(f'{"-"*10}{a}{"-"*10}')
new_data = sorted([_date(i) for _, i in b], reverse=True)
final_results = [[a, list(b)] for a, b in itertools.groupby(new_data)]
for _, c in final_results:
print('*'*20)
print('\n'.join(i._original for i in c))
print('*'*20)
输出:
----------2018-09-01----------
********************
input2-2018-09-01--22-35-43
input1-2018-09-01--22-35-41
input1-2018-09-01--22-35-42
********************
********************
input1-2018-09-01--20-38-11
********************
********************
input1-2018-09-01--22-32-38
********************