我有两个pandas数据帧。一个包含我通常的测量值(时间索引)。来自不同源的第二帧包含系统状态。它也是时间索引的,但状态数据帧中的时间与我的数据帧的时间与测量值不匹配。我想要实现的是,现在测量数据帧中的每一行还包含在测量时间之前出现在状态数据帧中的最后一个状态。
举个例子,我有一个像这样的状态框架:
state
time
2013-02-14 12:29:37.101000 SystemReset
2013-02-14 12:29:39.103000 WaitFace
2013-02-14 12:29:39.103000 NormalExecution
2013-02-14 12:29:39.166000 GreetVisitors
2013-02-14 12:29:46.879000 AskForParticipation
2013-02-14 12:29:56.807000 IntroduceVernissage
2013-02-14 12:30:07.275000 PictureQuestion
我的测量结果如下:
utime
time
2013-02-14 12:29:38.697038 0
2013-02-14 12:29:38.710432 1
2013-02-14 12:29:39.106475 2
2013-02-14 12:29:39.200701 3
2013-02-14 12:29:40.197014 0
2013-02-14 12:29:42.217976 5
2013-02-14 12:29:57.460601 7
我想最终得到这样的数据框:
utime state
time
2013-02-14 12:29:38.697038 0 SystemReset
2013-02-14 12:29:38.710432 1 SystemReset
2013-02-14 12:29:39.106475 2 NormalExecution
2013-02-14 12:29:39.200701 3 GreetVisitors
2013-02-14 12:29:40.197014 0 GreetVisitors
2013-02-14 12:29:42.217976 5 GreetVisitors
2013-02-14 12:29:57.460601 7 Introducevernissage
我找到了一个非常低效的解决方案:
result = measurements.copy()
stateList = []
for timestamp, _ in measurements.iterrows():
candidateStates = states.truncate(after=timestamp).tail(1)
if len(candidateStates) > 0:
stateList.append(candidateStates['state'].values[0])
else:
stateList.append("unknown")
result['state'] = stateList
你认为有什么方法可以优化它吗?
答案 0 :(得分:2)
也许像
df = df1.join(df2, how='outer')
df['state'].fillna(method='ffill',inplace=True)
df.dropna()
会奏效吗? join
生成:
>>> df
state utime
time
2013-02-14 12:29:37.101000 SystemReset NaN
2013-02-14 12:29:38.697038 NaN 0
2013-02-14 12:29:38.710432 NaN 1
2013-02-14 12:29:39.103000 WaitFace NaN
2013-02-14 12:29:39.103000 NormalExecution NaN
2013-02-14 12:29:39.106475 NaN 2
2013-02-14 12:29:39.166000 GreetVisitors NaN
2013-02-14 12:29:39.200701 NaN 3
2013-02-14 12:29:40.197014 NaN 0
2013-02-14 12:29:42.217976 NaN 5
2013-02-14 12:29:46.879000 AskForParticipation NaN
2013-02-14 12:29:56.807000 IntroduceVernissage NaN
2013-02-14 12:29:57.460601 NaN 7
2013-02-14 12:30:07.275000 PictureQuestion NaN
然后我们可以向前填写州列:
>>> df['state'].fillna(method='ffill',inplace=True)
time
2013-02-14 12:29:37.101000 SystemReset
2013-02-14 12:29:38.697038 SystemReset
2013-02-14 12:29:38.710432 SystemReset
2013-02-14 12:29:39.103000 WaitFace
2013-02-14 12:29:39.103000 NormalExecution
2013-02-14 12:29:39.106475 NormalExecution
2013-02-14 12:29:39.166000 GreetVisitors
2013-02-14 12:29:39.200701 GreetVisitors
2013-02-14 12:29:40.197014 GreetVisitors
2013-02-14 12:29:42.217976 GreetVisitors
2013-02-14 12:29:46.879000 AskForParticipation
2013-02-14 12:29:56.807000 IntroduceVernissage
2013-02-14 12:29:57.460601 IntroduceVernissage
2013-02-14 12:30:07.275000 PictureQuestion
Name: state
然后在没有时间的情况下放下行:
>>> df.dropna()
state utime
time
2013-02-14 12:29:38.697038 SystemReset 0
2013-02-14 12:29:38.710432 SystemReset 1
2013-02-14 12:29:39.106475 NormalExecution 2
2013-02-14 12:29:39.200701 GreetVisitors 3
2013-02-14 12:29:40.197014 GreetVisitors 0
2013-02-14 12:29:42.217976 GreetVisitors 5
2013-02-14 12:29:57.460601 IntroduceVernissage 7
您可能需要对其进行调整以处理与(可能的多个)状态同时具有utime的情况。可能会drop_duplicates
与take_last=True
进行此操作。在<
与<=
问题的早晨咖啡之前,您还必须比我能做的更难思考。