我有一个与此类似的DataFrame,但具有> 10000000行:
data = {'timestamp': ['1970-01-01 00:27:00', '1970-01-01 00:27:10', '1970-01-01 00:27:20',
'1970-01-01 00:27:30', '1970-01-01 00:27:40', '1970-01-01 00:27:50',
'1970-01-01 00:28:00', '1970-01-01 00:28:10', '1970-01-01 00:28:20',
'1970-01-01 00:28:30', '1970-01-01 00:28:40', '1970-01-01 00:28:50'],
'label': [0, 0, 1, 1, 1, 1, 0, 0, 1 , 1, 1 ,0]}
df = pd.DataFrame(data, columns=['label'], index=data['timestamp'])
df.index = pd.to_datetime(df.index)
Index label
1970-01-01 00:27:00 0
1970-01-01 00:27:10 0
1970-01-01 00:27:20 1
1970-01-01 00:27:30 1
1970-01-01 00:27:40 1
1970-01-01 00:27:50 1
1970-01-01 00:28:00 0
1970-01-01 00:28:10 0
1970-01-01 00:28:20 1
1970-01-01 00:28:30 1
1970-01-01 00:28:40 1
1970-01-01 00:28:50 0
目标是保留“标签”列等于0的所有行,并仅保留“标签”列等于1且在给定时间范围内唯一的行。例如,除了0值之外,我只希望将至少给出1的行保持至少30秒。 结果应如下所示:
Index label
1970-01-01 00:27:00 0
1970-01-01 00:27:10 0
1970-01-01 00:27:20 1
1970-01-01 00:27:30 1
1970-01-01 00:27:40 1
1970-01-01 00:27:50 1
1970-01-01 00:28:00 0
1970-01-01 00:28:10 0
1970-01-01 00:28:50 0
以下代码可以完成这项工作,但是对于庞大的数据集(如我所拥有的),这是不切实际的。
from datetime import timedelta
valid_range = 30
valid_df = df[df['label'] == 1].index.values.size
df_temp = df.copy()
drop_list = []
while valid_df != 0:
begin = df_temp[df_temp['label'] == 1].index[0]
end = begin + timedelta(seconds=valid_range)
if df_temp['label'].loc[begin:end].nunique() == 1:
df_temp = df_temp.loc[df_temp.index > end]
else:
df_temp.drop(begin, axis=0, inplace=True)
drop_list.append(begin)
valid_df = df_temp[df_temp['label'] == 1].index.values.size
df.drop(drop_list, axis=0, inplace=True)
关于如何更好/更快/以更少的内存消耗执行此操作的任何建议?
编辑:我的DataFrame可能有时间间隔并且不是连续的,所以我无法使用对此question的建议答案。
答案 0 :(得分:1)
您可以尝试组合groupby和过滤分组结果
import pandas as pd
data = {'timestamp': ['1970-01-01 00:27:00', '1970-01-01 00:27:10', '1970-01-01 00:27:20',
'1970-01-01 00:27:30', '1970-01-01 00:27:40', '1970-01-01 00:27:50',
'1970-01-01 00:28:00', '1970-01-01 00:28:10', '1970-01-01 00:28:20',
'1970-01-01 00:28:30', '1970-01-01 00:28:40', '1970-01-01 00:28:50'
],
'label': [0, 0, 1, 1, 1, 1, 0, 0, 1 , 1, 1 ,0]}
df = pd.DataFrame(data, columns=['label'], index=data['timestamp'])
df["time"] = df.index
df["time"] = pd.to_datetime(df["time"],errors='coerce')
df["delta"]= (df["time"]-df["time"].shift()).dt.total_seconds()
gp = df.groupby([(df.label != df.label.shift()).cumsum()])
rem = gp.filter(lambda g: g.delta.sum()>30)
new_df= pd.concat([rem[rem.label==1],df[df.label==0]], axis =0).sort_index()
答案 1 :(得分:0)
我想有很多方法可以做到这一点,而我只是采取一种方法。在您的样本上,其速度明显更快(100 loops, best of 3: 16.3 ms per loop
而不是10 loops, best of 3: 46.6 ms per loop
)。您可能可以进一步对其进行优化,但是要弄清楚所有步骤。
df['group'] = (df['label'] != df['label'].shift()).cumsum() # group together
df['first'] = df.groupby('group').transform('first') # first time of a group
df['first'] = pd.to_datetime(df['first']) # convert
df['duration'] = (df['timestamp'] - df['first']).dt.seconds # get duration
df['max_duration'] = df.groupby('group')['duration'].transform('last') # get duration consecutive
df[(df['max_duration'] >= 30) | (df['label'] == 0)] # filter
我稍微改变了输入数据
import pandas as pd
data = {'timestamp': ['1970-01-01 00:27:00', '1970-01-01 00:27:10', '1970-01-01 00:27:20',
'1970-01-01 00:27:30', '1970-01-01 00:27:40', '1970-01-01 00:27:50',
'1970-01-01 00:28:00', '1970-01-01 00:28:10', '1970-01-01 00:28:20',
'1970-01-01 00:28:30', '1970-01-01 00:28:40', '1970-01-01 00:28:50'],
'label': [0, 0, 1, 1, 1, 1, 0, 0, 1 , 1, 1 ,0]}
df = pd.DataFrame(data, columns=['timestamp', 'label', 'group', 'first'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
答案 2 :(得分:0)
我想出了一种适合自己情况的解决方案。我将DataFrame扩展了一些“挑战性”数据点。
data = {'timestamp': ['1970-01-01 00:27:00', '1970-01-01 00:27:10', '1970-01-01 00:27:20',
'1970-01-01 00:27:30', '1970-01-01 00:27:40', '1970-01-01 00:27:50',
'1970-01-01 00:28:00', '1970-01-01 00:28:10', '1970-01-01 00:28:20',
'1970-01-01 00:28:30', '1970-01-01 00:28:40', '1970-01-01 00:28:50',
'1970-01-01 00:32:10', '1970-01-01 00:33:50', '1970-01-01 00:34:58',
'1970-01-01 00:34:59', '1970-01-01 00:35:20', '1970-01-01 00:35:25',
'1970-01-01 00:35:30', '1970-01-01 00:35:56', '1970-01-01 00:35:59',
'1970-01-01 00:36:24'],
'label': [0, 0, 1, 1, 1, 1, 0, 0, 1 , 1, 1 ,0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1]}
df = pd.DataFrame(data, columns=['label'], index=data['timestamp'])
df.index = pd.to_datetime(df.index)
功能:
def check_time_range(df, column, valid_range=30):
df['diff'] = df[column].diff()
begin_points = df.index[df['diff'] == 1].tolist()
drop_list = []
for begin in begin_points:
end = begin + timedelta(seconds=valid_range)
if not df[column].loc[begin:end].nunique() == 1 or \
df[column][(df[column] == 1) & (df.index >= begin) & (df.index < end)].sum() <= 1:
try:
# Get the index where 'label' changes back to 0
changed_back = df[(df['diff'] == -1) & (df.index >= begin)].index[0]
index_list = df.index[(df.index >= begin) & (df.index < changed_back)].tolist()
except IndexError:
index_list = df.index[(df.index >= begin)].tolist()
drop_list.append(index_list)
flatten_drop_list = [item for sublist in drop_list for item in sublist]
df_new = df.drop(flatten_drop_list, axis=0)
return df_new
时间:
In [1]: %timeit df_new = check_time_range(df, 'label', 30)
12.8 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)