假设我有一系列发生在不同按键上的事件。
data = [
{"key": "A", "event": "created"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "A", "event": "updated"},
{"key": "B", "event": "created"},
{"key": "B", "event": "updated"},
{"key": "B", "event": "updated"},
{"key": "C", "event": "created"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
{"key": "C", "event": "updated"},
]
df = pandas.DataFrame(data)
我想先在键上索引我的DataFrame,然后再枚举。看起来像是简单的取消堆栈操作,但是我找不到正确的操作方法。
我能做的最好的是
df.set_index("key", append=True).swaplevel(0, 1)
event
key
A 0 created
1 updated
2 updated
3 updated
B 4 created
5 updated
6 updated
C 7 created
8 updated
9 updated
10 updated
11 updated
12 updated
但是我期望的是
event
key
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
我也尝试过
df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()
,但是不保留顺序,因此无法将结果用作索引。此外,对于看起来很标准的手术,我觉得有些矫kill过正...
有什么主意吗?
答案 0 :(得分:6)
groupby
+ cumcount
以下是几种方法:
# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])\
.rename_axis(['key','count'])
# original version
df = df.assign(count=df.groupby('key').cumcount())\
.set_index(['key', 'count'])
print(df)
event
key count
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated
答案 1 :(得分:0)
您可以在numpy中执行以下操作:
# df like in OP
keys = df['key'].values
# detect indices where key changes value
change = np.zeros(keys.size, dtype=int)
change[1:] = keys[1:] != keys[:-1]
# naive sequential number
seq = np.arange(keys.size)
# offset by seq at most recent change
offset = np.maximum.accumulate(change * seq)
df['seq'] = seq - offset
print(df.set_index(['key', 'seq']))
event
key seq
A 0 created
1 updated
2 updated
3 updated
B 0 created
1 updated
2 updated
C 0 created
1 updated
2 updated
3 updated
4 updated
5 updated