为什么数据帧中两行的datetime差为零?

时间:2019-07-08 02:19:39

标签: python pandas numpy dataframe datetime

我面临的这个问题非常简单,但很奇怪,一直困扰着我。

我有一个数据框,如下所示:

df['datetime'] = df['datetime'].dt.tz_convert('US/Pacific') 
#converting datetime from datetime64[ns, UTC] to datetime64[ns,US/Pacific]

df.head()

                vehicle_id  trip_id                                 datetime    
        6760612 1000500 4f874888ce404720a203e36f1cf5b716    2017-01-01 10:00:00-08:00       
        6760613 1000500 4f874888ce404720a203e36f1cf5b716    2017-01-01 10:00:01-08:00    
        6760614 1000500 4f874888ce404720a203e36f1cf5b716    2017-01-01 10:00:02-08:00      
        6760615 1000500 4f874888ce404720a203e36f1cf5b716    2017-01-01 10:00:03-08:00       
        6760616 1000500 4f874888ce404720a203e36f1cf5b716    2017-01-01 10:00:04-08:00

df.info ()

vehicle_id         int64
trip_id            object
datetime           datetime64[ns, US/Pacific]

我试图找出数据时间差异如下(以两种不同方式):

df['datetime_diff'] = df['datetime'].diff()

df['time_diff'] = (df['datetime'] - df['datetime'].shift(1)).astype('timedelta64[s]')

对于特定的trip_id,我得到的结果如下:

df[trip_frame['trip_id'] == '4f874888ce404720a203e36f1cf5b716'][['datetime','datetime_diff','time_diff']].head()

        datetime                  datetime_diff time_diff
6760612 2017-01-01 10:00:00-08:00   NaT             NaN
6760613 2017-01-01 10:00:01-08:00   00:00:01        1.0
6760614 2017-01-01 10:00:02-08:00   00:00:01        1.0
6760615 2017-01-01 10:00:03-08:00   00:00:01        1.0
6760616 2017-01-01 10:00:04-08:00   00:00:01        1.0

但是对于像下面这样的其他trip_id,您可以观察到实际上我的datetime差为零(两个列都为零)。时差以秒为单位。

df[trip_frame['trip_id'] == '01b8a24510cd4e4684d67b96369286e0'][['datetime','datetime_diff','time_diff']].head(4)

         datetime            datetime_diff  time_diff
3236107 2017-01-28 03:00:00-08:00   0 days  0.0
3236108 2017-01-28 03:00:01-08:00   0 days  0.0
3236109 2017-01-28 03:00:02-08:00   0 days  0.0
3236110 2017-01-28 03:00:03-08:00   0 days  0.0

df[df['trip_id'] == '01c2a70c25e5428bb33811ca5eb19270'][['datetime','datetime_diff','time_diff']].head(4)

        datetime             datetime_diff  time_diff
8915474 2017-01-21 10:00:00-08:00   0 days  0.0
8915475 2017-01-21 10:00:01-08:00   0 days  0.0
8915476 2017-01-21 10:00:02-08:00   0 days  0.0
8915477 2017-01-21 10:00:03-08:00   0 days  0.0

任何人都知道真正的问题是什么?我将非常感谢。

1 个答案:

答案 0 :(得分:0)

如果我不执行类型转换就执行您的代码,那么一切都会很好:

df.timestamp - df.timestamp.shift(1)

在示例行上

rows=['2017-01-21 10:00:00-08:00',
 '2017-01-21 10:00:01-08:00',
 '2017-01-21 10:00:02-08:00',
 '2017-01-21 10:00:03-08:00',
 '2017-01-21 10:00:03-08:00']  # the above lines are from your example. I just invented this last line to have one equal entry
df= pd.DataFrame(rows, columns=['timestamp'])
df['timestamp']= df['timestamp'].astype('datetime64')
df.timestamp - df.timestamp.shift(1)

最后一行返回

Out[40]: 
0        NaT
1   00:00:01
2   00:00:01
3   00:00:01
4   00:00:00
Name: timestamp, dtype: timedelta64[ns]

到目前为止,这看起来并不令人怀疑。请注意,您已经有一个timedelta64系列。

如果我现在添加您的转化,我将得到:

(df.timestamp - df.timestamp.shift(1)).astype('timedelta64[s]')
Out[42]: 
0    NaN
1    1.0
2    1.0
3    1.0
4    0.0
Name: timestamp, dtype: float64

您看到,结果是一系列的浮点数。这可能是因为系列中有NaN。另一件事是附件[s]。这似乎不起作用。如果您使用[ns],它似乎可以工作。如果您想以某种方式摆脱纳秒级误差,我想您需要单独进行。