我有一些带有时间戳和位置数据的数据,如下所示:
A 2013-02-05 19:45:00 (39.94, -86.159)
A 2013-02-05 19:55:00 (39.94, -86.159)
A 2013-02-05 20:00:00 (39.777, -85.995)
A 2013-02-05 20:05:00 (39.775, -85.978)
B 2013-02-05 22:20:00 (39.935, -86.159)
B 2013-02-05 22:25:00 (39.935, -86.159)
B 2013-02-05 23:55:00 (39.951, -86.151)
B 2013-02-06 00:00:00 (39.951, -86.151)
B 2013-02-06 00:05:00 (39.906, -86.196)
C 2013-02-06 00:25:00 (39.82, -86.249)
C 2013-02-06 00:30:00 (39.82, -86.249)
C 2013-02-06 02:45:00 (41.498, -81.527)
C 2013-02-06 02:55:00 (41.498, -81.527)
C 2013-02-06 04:35:00 (39.82, -86.249)
C 2013-02-06 04:40:00 (39.82, -86.249)
我需要做的是每天为每个用户获取一个人在一个地方连续的次数的直方图。因此,我想标记每个连续时段,每个用户每天的位置保持不变。
我如何在python pandas中解决这个问题?
如用户C所示,可以在一天内为用户重复该位置的情况,再次发生位置(39.82。-86.249)。因此,这些案例应被视为单独的连续时间。
答案 0 :(得分:1)
我认为你正在寻找pd.Series.shift
x = pd.Series([1, 3, 3, 2, 3, 3])
x
0 1
1 3
2 3
3 2
4 3
5 3
x.shift(-1)
0 3
1 3
2 2
3 3
4 3
5 NaN
(x != x.shift(-1)).sum()
4
假设问题中的数据是
的输出df[['COL1', 'COL2', 'COL3']]
然后,这应该可以为每位用户/每天提供一些独特的位置。我不确定这是否正是你想要的,但应该有助于开始
df['DATE'] = df.COL2.apply(lambda s: pd.to_datetime(s).date())
df.groupby(['COL1', 'DATE']).apply(lambda sdf: (sdf.COL3 != sdf.COL3).sum())
答案 1 :(得分:0)
你的意思是这样吗?
In [5]: df
Out[5]:
0 1 2 3
0 A 2013-02-05 19:45:00 39.940 -86.159
1 A 2013-02-05 19:55:00 39.940 -86.159
2 A 2013-02-05 20:00:00 39.777 -85.995
3 A 2013-02-05 20:05:00 39.775 -85.978
4 B 2013-02-05 22:20:00 39.935 -86.159
5 B 2013-02-05 22:25:00 39.935 -86.159
6 B 2013-02-05 23:55:00 39.951 -86.151
7 B 2013-02-06 00:00:00 39.951 -86.151
8 B 2013-02-06 00:05:00 39.906 -86.196
9 C 2013-02-06 00:25:00 39.820 -86.249
10 C 2013-02-06 00:30:00 39.820 -86.249
11 C 2013-02-06 02:45:00 41.498 -81.527
12 C 2013-02-06 02:55:00 41.498 -81.527
13 C 2013-02-06 04:35:00 39.820 -86.249
14 C 2013-02-06 04:40:00 39.820 -86.249
In [6]: def gb(df, *args, **kwargs):
...: for k, g in df.groupby(*args, **kwargs):
...: splt = np.split(g, np.where(np.diff(g.index.values)!=1)[0]+1)
...: for subg in splt:
...: if len(subg) >=2: yield k, subg
...:
In [7]: group_args = [0, df[1].apply(lambda x:x.date()), 2 , 3]
In [8]: for key, grp in gb(df, group_args, sort=False):
...: print key
...: print grp
...: print '-'*10
...:
打印:
('A', datetime.date(2013, 2, 5), 39.94, -86.159)
0 1 2 3
0 A 2013-02-05 19:45:00 39.94 -86.159
1 A 2013-02-05 19:55:00 39.94 -86.159
----------
('B', datetime.date(2013, 2, 5), 39.935, -86.159)
0 1 2 3
4 B 2013-02-05 22:20:00 39.935 -86.159
5 B 2013-02-05 22:25:00 39.935 -86.159
----------
('C', datetime.date(2013, 2, 6), 39.82, -86.249)
0 1 2 3
9 C 2013-02-06 00:25:00 39.82 -86.249
10 C 2013-02-06 00:30:00 39.82 -86.249
----------
('C', datetime.date(2013, 2, 6), 39.82, -86.249)
0 1 2 3
13 C 2013-02-06 04:35:00 39.82 -86.249
14 C 2013-02-06 04:40:00 39.82 -86.249
----------
('C', datetime.date(2013, 2, 6), 41.498, -81.527)
0 1 2 3
11 C 2013-02-06 02:45:00 41.498 -81.527
12 C 2013-02-06 02:55:00 41.498 -81.527