我有一个pandas数据框,其中一列表示另一列中的位置值是否在其下方的行中发生了变化。例如,
2013-02-05 19:45:00 (39.94, -86.159) True
2013-02-05 19:50:00 (39.94, -86.159) True
2013-02-05 19:55:00 (39.94, -86.159) False
2013-02-05 20:00:00 (39.777, -85.995) False
2013-02-05 20:05:00 (39.775, -85.978) True
2013-02-05 20:10:00 (39.775, -85.978) True
2013-02-05 20:15:00 (39.775, -85.978) False
2013-02-05 20:20:00 (39.94, -86.159) True
2013-02-05 20:30:00 (39.94, -86.159) False
所以,我想要做的是逐行遍历此数据框并检查False
行。然后(可能会添加另一列)在该地方花费的总“连续”时间。可以像上面的例子一样再次访问同一个地方。在这种情况下,它被视为一个单独的条件。因此,对于上面的示例,例如:
2013-02-05 19:45:00 (39.94, -86.159) True 0
2013-02-05 19:50:00 (39.94, -86.159) True 0
2013-02-05 19:55:00 (39.94, -86.159) False 15
2013-02-05 20:00:00 (39.777, -85.995) False 5
2013-02-05 20:05:00 (39.775, -85.978) True 0
2013-02-05 20:10:00 (39.775, -85.978) True 0
2013-02-05 20:15:00 (39.775, -85.978) False 15
2013-02-05 20:20:00 (39.94, -86.159) True 0
2013-02-05 20:25:00 (39.94, -86.159) False 10
然后我会绘制每天使用hist()函数花费的这些“连续”时间的直方图。如何通过迭代数据帧从第一个数据帧中获取第二个数据帧?我是python和pandas的新手,真正的数据文件很大,所以我需要一些合理有效的东西。
答案 0 :(得分:7)
这是另一个拍摄
df['group'] = (df.condition == False).astype('int').cumsum().shift(1).fillna(0)
df
date long lat condition group
2/5/2013 19:45:00 39.940 -86.159 True 0
2/5/2013 19:50:00 39.940 -86.159 True 0
2/5/2013 19:55:00 39.940 -86.159 False 0
2/5/2013 20:00:00 39.777 -85.995 False 1
2/5/2013 20:05:00 39.775 -85.978 True 2
2/5/2013 20:10:00 39.775 -85.978 True 2
2/5/2013 20:15:00 39.775 -85.978 False 2
2/5/2013 20:20:00 39.940 -86.159 True 3
2/5/2013 20:25:00 39.940 -86.159 False 3
df['result'] = df.groupby(['group']).date.transform(lambda sdf: 5 *len(sdf))
df
date long lat condition group result
2/5/2013 19:45:00 39.940 -86.159 True 0 15
2/5/2013 19:50:00 39.940 -86.159 True 0 15
2/5/2013 19:55:00 39.940 -86.159 False 0 15
2/5/2013 20:00:00 39.777 -85.995 False 1 5
2/5/2013 20:05:00 39.775 -85.978 True 2 15
2/5/2013 20:10:00 39.775 -85.978 True 2 15
2/5/2013 20:15:00 39.775 -85.978 False 2 15
2/5/2013 20:20:00 39.940 -86.159 True 3 10
2/5/2013 20:25:00 39.940 -86.159 False 3 10
答案 1 :(得分:4)
你需要0.11-dev。我想这会给你你想要的东西。有关详细信息,请参阅本节:http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas,因为timedeltas是pandas支持的较新数据
继承你的数据(为了方便起见,我把long / lat分开了,关键是这个 条件列是bool)
In [137]: df = pd.read_csv(StringIO.StringIO(data),index_col=0,parse_dates=True)
In [138]: df
Out[138]:
date long lat condition
2013-02-05 19:45:00 39.940 -86.159 True
2013-02-05 19:50:00 39.940 -86.159 True
2013-02-05 19:55:00 39.940 -86.159 False
2013-02-05 20:00:00 39.777 -85.995 False
2013-02-05 20:05:00 39.775 -85.978 True
2013-02-05 20:10:00 39.775 -85.978 True
2013-02-05 20:15:00 39.775 -85.978 False
2013-02-05 20:20:00 39.940 -86.159 True
2013-02-05 20:25:00 39.940 -86.159 False
In [139]: df.dtypes
Out[139]:
date float64
long lat float64
condition bool
dtype: object
创建一些作为索引的日期列(这些是datetime64 [ns] dtype)
In [140]: df['date'] = df.index
In [141]: df['rdate'] = df.index
将rdate列设为False为NaT(np.nan's转换为NaT)
In [142]: df.loc[~df['condition'],'rdate'] = np.nan
从前一个值
向前填充NaTIn [143]: df['rdate'] = df['rdate'].ffill()
从日期中减去rdate,这会生成timedelta64 [ns]类型列 的时差
In [144]: df['diff'] = df['date']-df['rdate']
In [151]: df
Out[151]:
date long lat condition rdate \
2013-02-05 19:45:00 2013-02-05 19:45:00 -86.159 True 2013-02-05 19:45:00
2013-02-05 19:50:00 2013-02-05 19:50:00 -86.159 True 2013-02-05 19:50:00
2013-02-05 19:55:00 2013-02-05 19:55:00 -86.159 False 2013-02-05 19:50:00
2013-02-05 20:00:00 2013-02-05 20:00:00 -85.995 False 2013-02-05 19:50:00
2013-02-05 20:05:00 2013-02-05 20:05:00 -85.978 True 2013-02-05 20:05:00
2013-02-05 20:10:00 2013-02-05 20:10:00 -85.978 True 2013-02-05 20:10:00
2013-02-05 20:15:00 2013-02-05 20:15:00 -85.978 False 2013-02-05 20:10:00
2013-02-05 20:20:00 2013-02-05 20:20:00 -86.159 True 2013-02-05 20:20:00
2013-02-05 20:25:00 2013-02-05 20:25:00 -86.159 False 2013-02-05 20:20:00
diff
2013-02-05 19:45:00 00:00:00
2013-02-05 19:50:00 00:00:00
2013-02-05 19:55:00 00:05:00
2013-02-05 20:00:00 00:10:00
2013-02-05 20:05:00 00:00:00
2013-02-05 20:10:00 00:00:00
2013-02-05 20:15:00 00:05:00
2013-02-05 20:20:00 00:00:00
2013-02-05 20:25:00 00:05:00
diff列现在是timedelta64 [ns],所以你需要几分钟的整数 (仅供参考,因为大熊猫没有标量类型,所以现在有点笨拙 Timedelta类似于日期的时间戳)
(另外,你可能不得不在这个rdate系列之前做一个shift(),然后再填充,我想我已经在某个地方离开了......)但这就是想法
In [175]: df['diff'].map(lambda x: x.item().seconds/60)
Out[175]:
2013-02-05 19:45:00 0
2013-02-05 19:50:00 0
2013-02-05 19:55:00 5
2013-02-05 20:00:00 10
2013-02-05 20:05:00 0
2013-02-05 20:10:00 0
2013-02-05 20:15:00 5
2013-02-05 20:20:00 0
2013-02-05 20:25:00 5