我有两个数据帧。 DF1包含以下内容:
用户|时间间隔
User01 | [01/01/2014 08:12:00,01 / 01/2014 08:13:43]
User02 | [01/03/2014 07:21:44,01 / 04/2014 01:07:01]
DF 2包含事件:
用户|时间|值
User01 | 01/03/2014 04:11:00 | 9
User01 | 01/01/2014 08:10:00 | 12
User02 | 01/03/2014 09:11:00 | 3
User02 | 01/02/2014 011:10:00 | 21
我想在DF1中添加3列,包含时间间隔内每个用户的平均值,标准差和最大值,具体取决于DF2中的事件。
所以最终的结果应该是这样的:
用户|时间间隔|意思是|最大| StDev
User01 | [01/01/2014 08:12:00,01 / 01/2014 08:13:43] | NaN | NaN |南
User02 | [01/03/2014 07:21:44,01 / 04/2014 01:07:01] | 3 | 3 | 0
如果我的桌子很大,有什么方法可以做到这一点?是否有某种" groupby"函数用于基于另一个数据帧的时间间隔?
代码:
import pandas as pd
DF1 = pd.DataFrame({'User' : pd.Series(["User01", "User02"], index=['1', '2']), 'Time start' : pd.Series(["01/01/2014 08:12:00", "01/03/2014 07:21:44"], index=['1', '2']),'Time end' : pd.Series(["01/01/2014 08:13:43", "01/04/2014 01:07:01"], index=['1', '2'])})
DF2 = pd.DataFrame({'User' : pd.Series(["User01","User01","User02", "User02"], index=['1', '2','3','4']), 'Time' : pd.Series(["01/03/2014 04:11:00", "01/01/2014 08:10:00","01/03/2014 09:11:00","01/02/2014 011:10:00"], index=['1', '2','3', '4']),'Value' : pd.Series([9,12,3,21], index=['1', '2','3','4'])})
DF3 = pd.DataFrame({'User' : pd.Series(["User01", "User02"], index=['1', '2']), 'Time start' : pd.Series(["01/01/2014 08:12:00", "01/03/2014 07:21:44"], index=['1', '2']),'Time end' : pd.Series(["01/01/2014 08:13:43", "01/04/2014 01:07:01"], index=['1', '2']),'Mean' : pd.Series(["Nan", 3], index=['1', '2']),'Max' : pd.Series(["Nan", 3], index=['1', '2']),'StDev' : pd.Series(["Nan", 0], index=['1', '2'])})
答案 0 :(得分:-1)
首先,合并DF1和DF2
df = DF2.merge(DF1,on="User")
如果时间在时间开始和时间结束之间,并创建一个指标("保持")
import numpy as np
df.loc[:,"keep"] = (np.logical_and(df.loc[:,"Time start"]<=df.loc[:,"Time"],df.loc[:,"Time"]<=df.loc[:,"Time end"]))*1
输出:
Time User Value Time end Time start keep
2014-01-03 04:11:00 User01 9 2014-01-01 08:13:43 2014-01-01 08:12:00 0
2014-01-01 08:10:00 User01 12 2014-01-01 08:13:43 2014-01-01 08:12:00 0
2014-01-03 09:11:00 User02 3 2014-01-04 01:07:01 2014-01-03 07:21:44 1
2014-01-02 11:10:00 User02 21 2014-01-04 01:07:01 2014-01-03 07:21:44 0
现在只保留keep = 1
的行df = df.loc[df.keep==1,:]
现在使用groupby
对df进行聚合df4 = df.groupby("User")["Value"].agg(['max','mean','std']).reset_index()
输出:
User max mean std
User02 3 3 NaN
将df4与DF1合并
DF1.merge(df4.reset_index(), on="User",how="left")
输出:
Time end Time start User max mean std
2014-01-01 08:13:43 2014-01-01 08:12:00 User01 NaN NaN NaN
2014-01-04 01:07:01 2014-01-03 07:21:44 User02 3.0 3.0 NaN