我有以下传感器数据数据框:
Data_Digital Data_Analog Time
1 10 2015-02-01 00:00:00
1 12 2015-02-01 00:00:05
1 25 2015-02-01 07:45:07
1 25 2015-02-01 07:45:08
1 25 2015-02-01 21:45:10
0 25 2015-03-04 00:00:08
我需要比较位置0的“时间”和位置1的“时间”。如果两个团队之间的时差大于六个小时,则它们必须属于不同的类别。但是,如果时间差较小(<6小时),则它们必须属于同一类。 我需要在新的dataframe列中表示此类。
所需的输出是:
Data_Digital Data_Analog Time New_Col_Target
1 10 2015-02-01 00:00:00 1 # init with 1
1 12 2015-02-01 00:00:05 1
1 25 2015-02-01 07:45:07 2 # far from the previous
1 25 2015-02-01 07:45:08 2
1 25 2015-02-01 21:45:10 3 # far from the previous
0 25 2015-03-04 00:00:08 4 # far from the previous
下面是原始数据框:
import pandas as pd
df = pd.DataFrame({'Data_Digital': [1, 1, 1, 1, 1, 0],
'Data_Analog': [10, 12, 25, 25, 25, 25],
'Time': ['2015-02-01 00:00:00', '2015-02-01 00:00:05','2015-02-01 07:45:07',
'2015-02-01 07:45:08', '2015-02-01 21:45:10', '2015-03-04 00:00:08']})
print(df)
我试图做(但这是错误的):
index = 0
index2 = 1
df['New_Col_Target'] = 1
for i in range(0, len(df) -1):
for j in range(1, len(df)):
if(abs(pd.to_datetime(df['Time'].iloc[i]) -
pd.to_datetime(df['Time'].iloc[j])) >
pd.to_timedelta('0 day 06:00:00')):
# I don't know how to do the assignments
df['New_Col_Target'].iloc[i] = index
else:
# I don't know how to do the assignments
df['New_Col_Target'].iloc[i] = index2
index2 += 1
# New process
Date Init Date End Mean_Dig Mean_Analog
2015-02-01 00:00:00 2015-02-01 00:00:05 1 11
2015-02-01 07:45:07 2015-02-01 07:45:08 1 25
2015-02-01 07:45:08 2015-02-01 07:45:08 1 25
2015-03-04 00:00:08 2015-03-04 00:00:08 0 25
df_mean_group_New_Col_Target = pd.DataFrame({'Date Init': ['2015-02-01 00:00:00', '2015-02-01 07:45:07', '2015-02-01 07:45:08', '2015-03-04 00:00:08'],
'Date End': ['2015-02-01 00:00:05', '2015-02-01 07:45:08', '2015-02-01 07:45:08', '2015-03-04 00:00:08'],
'Mean_Data_Digital': [1, 1, 1, 0],
'Mean_Data_Analog': [11, 25, 25, 25]})
print(df_mean_group_New_Col_Target)
答案 0 :(得分:3)
使用diff
,pd.Timedelta
和cumsum
:
df['New_col_target'] = (df['Time'].diff() > pd.Timedelta(hours=6)).cumsum().add(1)
输出
Data_Digital Data_Analog Time New_col_target
0 1 10 2015-02-01 00:00:00 1
1 1 12 2015-02-01 00:00:05 1
2 1 25 2015-02-01 07:45:07 2
3 1 25 2015-02-01 07:45:08 2
4 1 25 2015-02-01 21:45:10 3
5 0 25 2015-03-04 00:00:08 4
如果您的Time
列还不是datetime
,请使用:
df['Time'] = pd.to_datetime(df['Time'])
Data_Digital
:我们必须使用groupby
:
m1 = df.groupby('Data_Digital')['Time'].diff().ge(pd.Timedelta(hours=6))
m2 = df['Data_Digital'].diff().ne(0)
df['New_col_target'] = (m1|m2).cumsum()
输出
Data_Digital Data_Analog Time New_col_target
0 1 10 2015-02-01 00:00:00 1
1 1 12 2015-02-01 00:00:05 1
2 1 25 2015-02-01 07:45:07 2
3 1 25 2015-02-01 07:45:08 2
4 1 25 2015-02-01 21:45:10 3
5 0 25 2015-03-04 00:00:08 4
我们可以使用groupby.mean
:
df.groupby('New_col_target',as_index=False)[['Data_Digital', 'Data_Analog']].mean()
或
df.groupby('New_col_target',as_index=False).agg({'Data_Digital':'mean',
'Data_Analog':'mean'})
或者如果您有pandas >= 0.25.0
(请检查pd.__version__
),我们可以使用named_agreggations
:
df.groupby('New_col_target').agg(
Digital_mean=('Data_Digital', 'mean'),
Analog_mean=('Data_Analog', 'mean')
).reset_index()
输出
New_col_target Data_Digital Data_Analog
0 1 1 11
1 2 1 25
2 3 1 25
3 4 0 25
输出命名聚合
New_col_target Digital_mean Analog_mean
0 1 1 11
1 2 1 25
2 3 1 25
3 4 0 25