我有一个数据框:
>>> d.head()
Out[11]:
SOURCE
Time
2017-04-03 09:05:07+08:00 g
2017-04-03 09:05:09.744000+08:00 h
2017-04-03 09:05:17.168000+08:00 h
2017-04-03 09:05:27.118000+08:00 f
2017-04-03 09:05:55.616000+08:00 r
>>> d.index
Out[17]:
DatetimeIndex([ '2017-04-03 09:05:07+08:00', '2017-04-03 09:05:09.744000+08:00',...'2017-06-20 04:58:49.685000+08:00'], dtype='datetime64[ns]', name=u'Time', length=783743, freq=None, tz='Asia/Singapore')
我想添加一个新列,它等于连续读数之间的时间差。我正在尝试这些但没有工作:
1
d['timediff']= d.index.diff()
2
temp = pd.DataFrame(d.index)
d['timediff']= temp.diff().iloc[:,0]
3
temp = pd.DataFrame(d.index)
d['timediff']= pd.Series(temp.diff().iloc[:,0], index=d.index)
4
temp = pd.DataFrame(d.index)
d.assign(td=temp.diff())
所有这些都导致'timediff'专栏中的NaNs。
最后这个有效:
temp = pd.DataFrame(d.index)
temp = temp.diff().iloc[:,0].values
d = d.assign(timediff = temp)
有人可以澄清这里发生了什么吗?仅供参考,这是我得到的temp.diff:
>>> temp.diff().iloc[0:5,0]
Out[13]:
0 NaN
1 0 days 00:00:02.744000
2 0 days 00:00:07.424000
3 0 days 00:00:09.950000
4 0 days 00:00:28.498000
Name: Time, dtype: object
此外,我还有另一个(次要)问题 - d读取的索引如'2017-04-03 09:05:09.744000 + 08:00'。这在我转换索引的时区后发生。知道每个指数值中+8:00指的是什么?
答案 0 :(得分:1)
我认为您首先需要转换index
to_series
,因为index.diff()
尚未实现。
同样需要新Series
的原始索引,否则获取NaT
s:
print (d.index.to_series())
Time
2017-04-03 09:05:07+08:00 2017-04-03 01:05:07.000
2017-04-03 09:05:09.744000+08:00 2017-04-03 01:05:09.744
2017-04-03 09:05:17.168000+08:00 2017-04-03 01:05:17.168
2017-04-03 09:05:27.118000+08:00 2017-04-03 01:05:27.118
2017-04-03 09:05:55.616000+08:00 2017-04-03 01:05:55.616
Name: Time, dtype: datetime64[ns]
d['diff'] = d.index.to_series().diff()
print (d)
SOURCE diff
Time
2017-04-03 09:05:07+08:00 g NaT
2017-04-03 09:05:09.744000+08:00 h 00:00:02.744000
2017-04-03 09:05:17.168000+08:00 h 00:00:07.424000
2017-04-03 09:05:27.118000+08:00 f 00:00:09.950000
2017-04-03 09:05:55.616000+08:00 r 00:00:28.498000
print (pd.Series(d.index))
0 2017-04-03 09:05:07+08:00
1 2017-04-03 09:05:09.744000+08:00
2 2017-04-03 09:05:17.168000+08:00
3 2017-04-03 09:05:27.118000+08:00
4 2017-04-03 09:05:55.616000+08:00
Name: Time, dtype: datetime64[ns, Asia/Singapore]
d['diff'] = pd.Series(d.index).diff()
print (d)
SOURCE diff
Time
2017-04-03 09:05:07+08:00 g NaT
2017-04-03 09:05:09.744000+08:00 h NaT
2017-04-03 09:05:17.168000+08:00 h NaT
2017-04-03 09:05:27.118000+08:00 f NaT
2017-04-03 09:05:55.616000+08:00 r NaT
转换为DataFrame
也需要分配index
并选择列Series
:
d['diff'] = pd.DataFrame(d.index, index=d.index)['Time'].diff()
print (d)
SOURCE diff
Time
2017-04-03 09:05:07+08:00 g NaT
2017-04-03 09:05:09.744000+08:00 h 00:00:02.744000
2017-04-03 09:05:17.168000+08:00 h 00:00:07.424000
2017-04-03 09:05:27.118000+08:00 f 00:00:09.950000
2017-04-03 09:05:55.616000+08:00 r 00:00:28.498000
d['diff'] = pd.DataFrame(d.index, index=d.index).iloc[:, 0].diff()
print (d)
SOURCE diff
Time
2017-04-03 09:05:07+08:00 g NaT
2017-04-03 09:05:09.744000+08:00 h 00:00:02.744000
2017-04-03 09:05:17.168000+08:00 h 00:00:07.424000
2017-04-03 09:05:27.118000+08:00 f 00:00:09.950000
2017-04-03 09:05:55.616000+08:00 r 00:00:28.498000
最后一个版本的pandas完美地使用了时区。如果需要将索引转换为UTC
需要DatetimeIndex.tz_convert
或DataFrame.tz_convert
:
d.index = d.index.tz_convert('UTC')
print (d)
SOURCE
Time
2017-04-03 01:05:07+00:00 g
2017-04-03 01:05:09.744000+00:00 h
2017-04-03 01:05:17.168000+00:00 h
2017-04-03 01:05:27.118000+00:00 f
2017-04-03 01:05:55.616000+00:00 r
d = d.tz_convert('UTC')
print (d)
SOURCE
Time
2017-04-03 01:05:07+00:00 g
2017-04-03 01:05:09.744000+00:00 h
2017-04-03 01:05:17.168000+00:00 h
2017-04-03 01:05:27.118000+00:00 f
2017-04-03 01:05:55.616000+00:00 r
要从timezone
移除DatetieIndex
:
d = d.tz_convert('UTC').tz_localize(None)
print (d)
SOURCE
Time
2017-04-03 01:05:07.000 g
2017-04-03 01:05:09.744 h
2017-04-03 01:05:17.168 h
2017-04-03 01:05:27.118 f
2017-04-03 01:05:55.616 r
但请注意或仅删除 - 只需删除+8:00
并获得不同的时间:
d = d.tz_localize(None)
print (d)
SOURCE
Time
2017-04-03 09:05:07.000 g
2017-04-03 09:05:09.744 h
2017-04-03 09:05:17.168 h
2017-04-03 09:05:27.118 f
2017-04-03 09:05:55.616 r
见差异:
d = d.tz_convert('UTC').tz_localize(None).tz_localize('UTC').tz_convert('Asia/Singapore')
print (d)
SOURCE
Time
2017-04-03 09:05:07+08:00 g
2017-04-03 09:05:09.744000+08:00 h
2017-04-03 09:05:17.168000+08:00 h
2017-04-03 09:05:27.118000+08:00 f
2017-04-03 09:05:55.616000+08:00 r
VS
d = d.tz_localize(None).tz_localize('UTC').tz_convert('Asia/Singapore')
print (d)
SOURCE
Time
2017-04-03 17:05:07+08:00 g
2017-04-03 17:05:09.744000+08:00 h
2017-04-03 17:05:17.168000+08:00 h
2017-04-03 17:05:27.118000+08:00 f
2017-04-03 17:05:55.616000+08:00 r
答案 1 :(得分:0)
>>>> df
SOURCE
Time
2017-04-03 09:05:07+08:00 g
2017-04-03 09:05:09.744000+08:00 h
2017-04-03 09:05:17.168000+08:00 h
2017-04-03 09:05:27.118000+08:00 f
2017-04-03 09:05:55.616000+08:00 r
由于您无法在索引上使用.diff()
,请先将其转换为列:
>>>> df['Time'] = df.index
>>>> df
SOURCE Time
Time
2017-04-03 09:05:07+08:00 g 2017-04-03 09:05:07+08:00
2017-04-03 09:05:09.744000+08:00 h 2017-04-03 09:05:09.744000+08:00
2017-04-03 09:05:17.168000+08:00 h 2017-04-03 09:05:17.168000+08:00
2017-04-03 09:05:27.118000+08:00 f 2017-04-03 09:05:27.118000+08:00
2017-04-03 09:05:55.616000+08:00 r 2017-04-03 09:05:55.616000+08:00
然后它运作良好:
>>>> df['Time'].diff()
Time
2017-04-03 09:05:07+08:00 NaT
2017-04-03 09:05:09.744000+08:00 00:00:02.744000
2017-04-03 09:05:17.168000+08:00 00:00:07.424000
2017-04-03 09:05:27.118000+08:00 00:00:09.950000
2017-04-03 09:05:55.616000+08:00 00:00:28.498000
Name: Time, dtype: timedelta64[ns]