我有一个来自csv的数据框,其中包含以下几列: user_id,路径,时间戳,性别
| user_id | path | timestamp | gender |
|:-------: |------ |--------------------- |-------- |
| 0 | 1 | 2017-01-01 01:08:56 | f |
| 0 | 2 | 2017-01-01 01:07:56 | f |
| 0 | 3 | 2017-01-01 01:08:40 | f |
| 0 | 4 | 2017-01-01 01:04:36 | f |
| 0 | 5 | 2017-01-01 01:09:53 | f |
| 0 | 6 | 2017-01-01 01:12:33 | f |
| 0 | 7 | 2017-01-01 01:14:12 | f |
| 0 | 8 | 2017-01-01 01:16:25 | f |
| 0 | 9 | 2017-01-01 01:16:56 | f |
| 1 | 1 | 2017-01-01 01:08:56 | m |
| 1 | 2 | 2017-01-01 01:08:06 | m |
| 1 | 3 | 2017-01-01 01:10:51 | m |
| 1 | 4 | 2017-01-01 01:13:53 | m |
| 2 | 1 | 2017-01-01 01:08:56 | f |
| 3 | 2 | 2017-01-01 01:34:56 | m |
输出应类似于以下元素序列:
| paths | timestamps | gender |
|------------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |-------- |
| 1,2,3,4,5,6,7,8,9 | 2017-01-01 01:08:56, 2017-01-01 01:07:56, 2017-01-01 01:08:40, 2017-01-01 01:04:36, 2017-01-01 01:09:53, 2017-01-01 01:12:33, 2017-01-01 01:14:12, 2017-01-01 01:16:25, 2017-01-01 01:16:56 | f |
问题是来自不同时间戳的同一个user_id有多个行,我需要一个序列来进行时间序列分类(根据路径预测性别)。此外,时间戳记在整个数据框中并不是唯一的,但它们是针对每个用户的。
我首先用下面的代码尝试了pandas groupby函数
dictionary = {}
for name, group in grouped:
index = name[0]
if dictionary.get(index, -1) == -1:
dictionary[index] = {"sequence": group.path.values, "timestamps": group.timestamp.values, "gender": group.gender.values[0]}
else:
dictionary[index]["sequence"] = [dictionary[index]["sequence"], group.path.values]
这真的不起作用,因为我无法获取值(它保持多索引),而且我无法从每个组中提取值。
此外,我还尝试了以下代码段:
dictionary = {}
for name, group in grouped:
index = name[0]
if dictionary.get(index, -1) == -1:
dictionary[index] = {"sequence": group.path.values, "timestamps": group.timestamp.values, "gender": group.gender.values[0]}
else:
dictionary[index]["sequence"] = [dictionary[index]["sequence"], group.path.values]
感谢您的帮助!