计算Pandas Dataframe索引之间的时差

时间:2013-05-27 17:01:31

标签: python dataframe pandas

我正在尝试将deltaT列添加到数据框中,其中deltaT是连续行之间的时间差(在时间序列中编入索引)。

time                 value

2012-03-16 23:50:00      1
2012-03-16 23:56:00      2
2012-03-17 00:08:00      3
2012-03-17 00:10:00      4
2012-03-17 00:12:00      5
2012-03-17 00:20:00      6
2012-03-20 00:43:00      7

所需结果如下所示(deltaT单位以分钟显示):

time                 value  deltaT

2012-03-16 23:50:00      1       0
2012-03-16 23:56:00      2       6
2012-03-17 00:08:00      3      12
2012-03-17 00:10:00      4       2
2012-03-17 00:12:00      5       2
2012-03-17 00:20:00      6       8
2012-03-20 00:43:00      7      23

3 个答案:

答案 0 :(得分:50)

注意这是使用numpy> = 1.7,用于numpy< 1.7,请在此处查看转换:http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas

原始框架,带有日期时间索引

In [196]: df
Out[196]: 
                     value
2012-03-16 23:50:00      1
2012-03-16 23:56:00      2
2012-03-17 00:08:00      3
2012-03-17 00:10:00      4
2012-03-17 00:12:00      5
2012-03-17 00:20:00      6
2012-03-20 00:43:00      7

In [199]: df.index
Out[199]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-16 23:50:00, ..., 2012-03-20 00:43:00]
Length: 7, Freq: None, Timezone: None

这是你想要的timedelta64

In [200]: df['tvalue'] = df.index

In [201]: df['delta'] = (df['tvalue']-df['tvalue'].shift()).fillna(0)

In [202]: df
Out[202]: 
                     value              tvalue            delta
2012-03-16 23:50:00      1 2012-03-16 23:50:00         00:00:00
2012-03-16 23:56:00      2 2012-03-16 23:56:00         00:06:00
2012-03-17 00:08:00      3 2012-03-17 00:08:00         00:12:00
2012-03-17 00:10:00      4 2012-03-17 00:10:00         00:02:00
2012-03-17 00:12:00      5 2012-03-17 00:12:00         00:02:00
2012-03-17 00:20:00      6 2012-03-17 00:20:00         00:08:00
2012-03-20 00:43:00      7 2012-03-20 00:43:00 3 days, 00:23:00

在忽略日差的情况下得出答案(你的最后一天是3/20,之前是3/17),实际上很棘手

In [204]: df['ans'] = df['delta'].apply(lambda x: x  / np.timedelta64(1,'m')).astype('int64') % (24*60)

In [205]: df
Out[205]: 
                     value              tvalue            delta  ans
2012-03-16 23:50:00      1 2012-03-16 23:50:00         00:00:00    0
2012-03-16 23:56:00      2 2012-03-16 23:56:00         00:06:00    6
2012-03-17 00:08:00      3 2012-03-17 00:08:00         00:12:00   12
2012-03-17 00:10:00      4 2012-03-17 00:10:00         00:02:00    2
2012-03-17 00:12:00      5 2012-03-17 00:12:00         00:02:00    2
2012-03-17 00:20:00      6 2012-03-17 00:20:00         00:08:00    8
2012-03-20 00:43:00      7 2012-03-20 00:43:00 3 days, 00:23:00   23

答案 1 :(得分:23)

我们可以使用to_series创建一个索引和值等于索引键的系列,然后计算连续行之间的差异,这将导致localhost dtype。获得此项后,通过timedelta64[ns]属性,我们可以访问时间部分的seconds属性,最后将每个元素除以60,以便在几分钟内输出(可选择用0填充第一个值)。

.dt

<强> 简化:

当我们执行In [13]: df['deltaT'] = df.index.to_series().diff().dt.seconds.div(60, fill_value=0) ...: df # use .astype(int) to obtain integer values Out[13]: value deltaT time 2012-03-16 23:50:00 1 0.0 2012-03-16 23:56:00 2 6.0 2012-03-17 00:08:00 3 12.0 2012-03-17 00:10:00 4 2.0 2012-03-17 00:12:00 5 2.0 2012-03-17 00:20:00 6 8.0 2012-03-20 00:43:00 7 23.0 时:

diff

秒转换为分钟:

In [8]: ser_diff = df.index.to_series().diff()

In [9]: ser_diff
Out[9]: 
time
2012-03-16 23:50:00               NaT
2012-03-16 23:56:00   0 days 00:06:00
2012-03-17 00:08:00   0 days 00:12:00
2012-03-17 00:10:00   0 days 00:02:00
2012-03-17 00:12:00   0 days 00:02:00
2012-03-17 00:20:00   0 days 00:08:00
2012-03-20 00:43:00   3 days 00:23:00
Name: time, dtype: timedelta64[ns]

如果您想要包括先前排除的In [10]: ser_diff.dt.seconds.div(60, fill_value=0) Out[10]: time 2012-03-16 23:50:00 0.0 2012-03-16 23:56:00 6.0 2012-03-17 00:08:00 12.0 2012-03-17 00:10:00 2.0 2012-03-17 00:12:00 2.0 2012-03-17 00:20:00 8.0 2012-03-20 00:43:00 23.0 Name: time, dtype: float64 部分(仅考虑时间部分),dt.total_seconds将为您提供经过的持续时间(以秒为单位),然后可以计算分钟数再由分裂。

date

答案 2 :(得分:2)

>= Numpy version 1.7.0.

还可以从df.index.to_series().diff()(纳秒-默认dtype)到timedelta64[ns](分钟)[Frequency conversion({astyping层划分)]

timedelta64[m]

ΔT dtype: df['ΔT'] = df.index.to_series().diff().astype('timedelta64[m]') value ΔT time 2012-03-16 23:50:00 1 NaN 2012-03-16 23:56:00 2 6.0 2012-03-17 00:08:00 3 12.0 2012-03-17 00:10:00 4 2.0 2012-03-17 00:12:00 5 2.0 2012-03-17 00:20:00 6 8.0 2012-03-20 00:43:00 7 4343.0

如果要转换为float64,请在转换前用int填充na

0

Timedelta数据类型支持大量的时间单位,以及可以强制转换为其他任何单位的通用单位。

以下是日期单位:

>>> df.index.to_series().diff().fillna(0).astype('timedelta64[m]').astype('int')

time
2012-03-16 23:50:00       0
2012-03-16 23:56:00       6
2012-03-17 00:08:00      12
2012-03-17 00:10:00       2
2012-03-17 00:12:00       2
2012-03-17 00:20:00       8
2012-03-20 00:43:00    4343
Name: time, dtype: int64

以下是时间单位:

Y   year
M   month
W   week
D   day

如果您希望将差值提高到小数点,请使用h hour m minute s second ms millisecond us microsecond ns nanosecond ps picosecond fs femtosecond as attosecond ,即除以np.timedelta64(1, 'm')
例如如果df如下,

true division

在下面检查asyping( value time 2012-03-16 23:50:21 1 2012-03-16 23:56:28 2 2012-03-17 00:08:08 3 2012-03-17 00:10:56 4 2012-03-17 00:12:12 5 2012-03-17 00:20:00 6 2012-03-20 00:43:43 7 )和 floor division 之间的区别。

true division