我面临的这个问题非常简单,但很奇怪,一直困扰着我。
我有一个数据框,如下所示:
df['datetime'] = df['datetime'].dt.tz_convert('US/Pacific')
#converting datetime from datetime64[ns, UTC] to datetime64[ns,US/Pacific]
df.head()
vehicle_id trip_id datetime
6760612 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:00-08:00
6760613 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:01-08:00
6760614 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:02-08:00
6760615 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:03-08:00
6760616 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:04-08:00
df.info ()
vehicle_id int64
trip_id object
datetime datetime64[ns, US/Pacific]
我试图找出数据时间差异如下(以两种不同方式):
df['datetime_diff'] = df['datetime'].diff()
df['time_diff'] = (df['datetime'] - df['datetime'].shift(1)).astype('timedelta64[s]')
对于特定的trip_id,我得到的结果如下:
df[trip_frame['trip_id'] == '4f874888ce404720a203e36f1cf5b716'][['datetime','datetime_diff','time_diff']].head()
datetime datetime_diff time_diff
6760612 2017-01-01 10:00:00-08:00 NaT NaN
6760613 2017-01-01 10:00:01-08:00 00:00:01 1.0
6760614 2017-01-01 10:00:02-08:00 00:00:01 1.0
6760615 2017-01-01 10:00:03-08:00 00:00:01 1.0
6760616 2017-01-01 10:00:04-08:00 00:00:01 1.0
但是对于像下面这样的其他trip_id,您可以观察到实际上我的datetime差为零(两个列都为零)。时差以秒为单位。
df[trip_frame['trip_id'] == '01b8a24510cd4e4684d67b96369286e0'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
3236107 2017-01-28 03:00:00-08:00 0 days 0.0
3236108 2017-01-28 03:00:01-08:00 0 days 0.0
3236109 2017-01-28 03:00:02-08:00 0 days 0.0
3236110 2017-01-28 03:00:03-08:00 0 days 0.0
df[df['trip_id'] == '01c2a70c25e5428bb33811ca5eb19270'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
8915474 2017-01-21 10:00:00-08:00 0 days 0.0
8915475 2017-01-21 10:00:01-08:00 0 days 0.0
8915476 2017-01-21 10:00:02-08:00 0 days 0.0
8915477 2017-01-21 10:00:03-08:00 0 days 0.0
任何人都知道真正的问题是什么?我将非常感谢。
答案 0 :(得分:0)
如果我不执行类型转换就执行您的代码,那么一切都会很好:
df.timestamp - df.timestamp.shift(1)
在示例行上
rows=['2017-01-21 10:00:00-08:00',
'2017-01-21 10:00:01-08:00',
'2017-01-21 10:00:02-08:00',
'2017-01-21 10:00:03-08:00',
'2017-01-21 10:00:03-08:00'] # the above lines are from your example. I just invented this last line to have one equal entry
df= pd.DataFrame(rows, columns=['timestamp'])
df['timestamp']= df['timestamp'].astype('datetime64')
df.timestamp - df.timestamp.shift(1)
最后一行返回
Out[40]:
0 NaT
1 00:00:01
2 00:00:01
3 00:00:01
4 00:00:00
Name: timestamp, dtype: timedelta64[ns]
到目前为止,这看起来并不令人怀疑。请注意,您已经有一个timedelta64系列。
如果我现在添加您的转化,我将得到:
(df.timestamp - df.timestamp.shift(1)).astype('timedelta64[s]')
Out[42]:
0 NaN
1 1.0
2 1.0
3 1.0
4 0.0
Name: timestamp, dtype: float64
您看到,结果是一系列的浮点数。这可能是因为系列中有NaN
。另一件事是附件[s]
。这似乎不起作用。如果您使用[ns]
,它似乎可以工作。如果您想以某种方式摆脱纳秒级误差,我想您需要单独进行。