Question

我有DataFrame测量值，包含测量值和时间。

time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in xrange(10)]
df_meas = pandas.DataFrame({'time': time, 'value': np.random.random(10)})

例如：

                 time     value
0 2011-01-01 21:56:00  0.115025
1 2011-01-01 04:40:00  0.678882
2 2011-01-01 02:18:00  0.507168
3 2011-01-01 22:40:00  0.938408
4 2011-01-01 12:53:00  0.193573
5 2011-01-01 19:37:00  0.464744
6 2011-01-01 16:06:00  0.794495
7 2011-01-01 18:32:00  0.482684
8 2011-01-01 13:26:00  0.381747
9 2011-01-01 01:50:00  0.035798

数据收集按时间段组织，我还有另一个DataFrame：

start = pandas.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pandas.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'

例如：

                     start                stop
run                                           
721158 2011-01-01 00:00:00 2011-01-01 00:50:00
340902 2011-01-01 01:00:00 2011-01-01 01:50:00
211578 2011-01-01 02:00:00 2011-01-01 02:50:00
120232 2011-01-01 03:00:00 2011-01-01 03:50:00
122199 2011-01-01 04:00:00 2011-01-01 04:50:00

现在我想合并两个表，获取：

                 time     value   run
0 2011-01-01 21:56:00  0.115025   NaN
1 2011-01-01 04:40:00  0.678882   122199  
2 2011-01-01 02:18:00  0.507168   211578 
3 2011-01-01 22:40:00  0.938408   NaN
...

时间段（run s）包含start和stop以及stop >= start。不同的运行从不重叠。（即使在我的示例中并非如此），您可以假设运行是按顺序排序的（run），如果是run1 < run2则start1 < start2（或者您可以简单地按{{{}}对表进行排序1}}）。您还可以假设start按df_meas排序。

怎么做？是否有东西内置？什么是最有效的方式？

Answer 1

您可以先stack - df_runs重新start，stop位于一列time。然后groupby按run，resample按minutes和ffill填写NaN值。上次merge到df_meas：

注意 - 此代码适用于上一个pandas版本0.18.1 see docs。

import pandas as pd
import numpy as np
import datetime as datetime

#for testing
np.random.seed(1)
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(10)]
df_meas = pd.DataFrame({'time': time, 'value': np.random.random(10)})

start = pd.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'

df = (df_runs.stack().reset_index(level=1, drop=True).reset_index(name='time'))
print (df)
      run                time
0   99335 2011-01-01 00:00:00
1   99335 2011-01-01 00:50:00
2  823615 2011-01-01 01:00:00
3  823615 2011-01-01 01:50:00
4  117565 2011-01-01 02:00:00
5  117565 2011-01-01 02:50:00
6  790038 2011-01-01 03:00:00
7  790038 2011-01-01 03:50:00
8  369977 2011-01-01 04:00:00
9  369977 2011-01-01 04:50:00

df1 = (df.set_index('time')
         .groupby('run')
         .resample('Min')
         .ffill()
         .reset_index(level=0, drop=True)
         .reset_index())

print (df1)
                   time     run
0   2011-01-01 00:00:00   99335
1   2011-01-01 00:01:00   99335
2   2011-01-01 00:02:00   99335
3   2011-01-01 00:03:00   99335
4   2011-01-01 00:04:00   99335
5   2011-01-01 00:05:00   99335
6   2011-01-01 00:06:00   99335
7   2011-01-01 00:07:00   99335
8   2011-01-01 00:08:00   99335
9   2011-01-01 00:09:00   99335
...
...

print (pd.merge(df_meas, df1, on='time', how='left'))
                 time     value       run
0 2011-01-01 05:44:00  0.524548       NaN
1 2011-01-01 12:09:00  0.443453       NaN
2 2011-01-01 09:12:00  0.229577       NaN
3 2011-01-01 05:16:00  0.534414       NaN
4 2011-01-01 00:17:00  0.913962   99335.0
5 2011-01-01 01:13:00  0.457205  823615.0
6 2011-01-01 07:46:00  0.430699       NaN
7 2011-01-01 06:26:00  0.939128       NaN
8 2011-01-01 18:21:00  0.778389       NaN
9 2011-01-01 05:19:00  0.715971       NaN

IanS的解决方案非常好，我尝试使用pd.lreshape进行改进：

df_runs['run1'] = -1 
df_runs = df_runs.reset_index()

run_times = (pd.lreshape(df_runs, {'Run':['run', 'run1'], 
                                   'Time':['start', 'stop']})
               .sort_values('Time')
               .set_index('Time'))

print (run_times['Run'].asof(df_meas['time']))

time
2011-01-01 05:44:00        -1
2011-01-01 12:09:00        -1
2011-01-01 09:12:00        -1
2011-01-01 05:16:00        -1
2011-01-01 00:17:00     99335
2011-01-01 01:13:00    823615
2011-01-01 07:46:00        -1
2011-01-01 06:26:00        -1
2011-01-01 18:21:00        -1
2011-01-01 05:19:00        -1
Name: Run, dtype: int64

Answer 2

编辑：根据评论中的建议，无需对时间进行排序。相反，请使用stack代替unstack。

第一步：转换时间数据框

由于开始和结束时间排序很好，我将它们设置为索引。我还添加了一个列，其中包含start的运行ID，以及NaN的列。我在很多方面都这样做（希望每一行都是不言自明的），但你肯定可以压缩代码：

run_times = df_runs.stack().to_frame(name='times')
run_times.reset_index(inplace=True)
run_times['actual_run'] = np.where(run_times['level_1'] == 'start', run_times['run'], np.nan)
run_times.drop(['level_1', 'run'], axis=1, inplace=True)
run_times.set_index('times', drop=True, inplace=True)

结果：

In[101] : run_times
Out[101]: 
                     actual_run
times                          
2011-01-01 00:00:00      110343
2011-01-01 00:50:00         NaN
2011-01-01 01:00:00      839451
2011-01-01 01:50:00         NaN
2011-01-01 02:00:00      742879
2011-01-01 02:50:00         NaN
2011-01-01 03:00:00      275509
2011-01-01 03:50:00         NaN
2011-01-01 04:00:00      788777
2011-01-01 04:50:00         NaN

第二步：查找值

现在，您可以使用asof方法在原始数据框中查找此内容：

In[131] : run_times['actual_run'].fillna(-1).asof(df_meas['time'])
Out[131]: 
2011-01-01 21:56:00        -1
2011-01-01 04:40:00    122199
2011-01-01 02:18:00    211578
2011-01-01 22:40:00        -1
2011-01-01 12:53:00        -1

请注意，我必须使用-1而不是NaN，因为asof会返回上一个有效值。

Answer 3

<强>被修改

如果您希望从正在排序的表中受益，有时（或通常）将它留给熊猫（或numpy）会更好。例如，合并两个已排序的数组，您可以手动完成，this answer建议。并且pandas使用低级函数自动完成。

我测量了asof使用的时间（如A.asof(I)中所示），看起来好像它没有从I被排序中受益。但是，如果可能的话，我没有看到一种简单的方法来击败它。

在我的测试中，当索引（asof）已包含.loc时，A.index甚至比I更快。我所知道的唯一可以利用被排序的索引的对象是pd.Index。事实上，A.reindex(idx)的{{1}}要快得多（使用它，idx = pd.Index(I)必须是唯一的）。不幸的是，构建正确的数据框架或系列所需的时间超过了其中的好处。

@IanS和@jezrael的回答非常快。事实上，大部分时间（几乎40％）在jezrael的第二个功能都用在A.index。 lreshape和sort_values占用最多15％。

当然，可以进一步优化它。结果非常好，所以我把它放在这里。

我使用以下设置生成用于测试的排序数据框：

asof

该功能受益于使用def setup(intervals, periods): time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(intervals)] df_meas = pd.DataFrame({'time': time, 'value': np.random.random(intervals)}) df_meas = df_meas.sort_values(by='time') df_meas.index = range(df_meas.shape[0]) start = pd.date_range('1/1/2011', periods=periods, freq='H') stop = start + np.timedelta64(50, 'm') df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.unique(np.random.randint(0, 1000000, periods))) df_runs.index.name = 'run' return df_meas, df_runs和一些技巧来减少不必要的格式化。

asof

我使用def run(df_meas, df_runs): run_times = pd.Series(np.concatenate([df_runs.index, [-1] * df_runs.shape[0]]), index=df_runs.values.flatten(order='F')) run_times.sort_index(inplace=True) return run_times.asof(df_meas['time'])和intervals=100对其进行了测试。结果用timeit测量：

periods=20

Answer 4

merge()函数可用于水平合并2个数据框：

merge(x, y, by ="name")  # merge df x and y using the "name" column

因此您可能需要在“时间”中重命名第一个数据帧的“开始”列，并尝试...

合并时间段

4 个答案: