我有大型pandas DataFrame,格式如下:
prod_id timestamp text
150523 0006641040 9.393408e+08 text_1
150500 0006641040 9.408096e+08 text_2
150499 0006641041 1.009325e+09 text_3
150508 0006641041 1.018397e+09 text_4
150524 0006641042 1.025482e+09 text_5
DataFrame按prod_id和timestamp排序。我想要做的是根据从最早到最晚的时间戳枚举每个prod_id的计数器。例如,我正在努力实现这样的目标:
prod_id timestamp text enum
150523 0006641040 9.393408e+08 text_1 1
150500 0006641040 9.408096e+08 text_2 2
150499 0006641041 1.009325e+09 text_3 1
150508 0006641041 1.018397e+09 text_4 2
150524 0006641042 1.025482e+09 text_5 1
通过遍历每一行并增加计数器,我可以非常轻松地迭代地完成这个操作,但有没有办法以更多功能的编程方式执行此操作?
由于
答案 0 :(得分:3)
<强>更新强>
In [324]: df
Out[324]:
prod_id timestamp text
150523 6641040 9.393408e+08 text_1
150500 6641040 9.408096e+08 text_2
150501 6641040 9.408096e+08 text_3
150499 6641041 1.009325e+09 text_3
150508 6641041 1.018397e+09 text_4
150524 6641042 1.025482e+09 text_5
In [325]: df['enum'] = df.groupby(['prod_id'])['timestamp'].cumcount() + 1
In [326]: df
Out[326]:
prod_id timestamp text enum
150523 6641040 9.393408e+08 text_1 1
150500 6641040 9.408096e+08 text_2 2
150501 6641040 9.408096e+08 text_3 3
150499 6641041 1.009325e+09 text_3 1
150508 6641041 1.018397e+09 text_4 2
150524 6641042 1.025482e+09 text_5 1
OLD回答:
In [314]: df['enum'] = df.groupby(['prod_id'])['timestamp'].rank().astype(int)
In [315]: df
Out[315]:
prod_id timestamp text enum
150523 6641040 9.393408e+08 text_1 1
150500 6641040 9.408096e+08 text_2 2
150499 6641041 1.009325e+09 text_3 1
150508 6641041 1.018397e+09 text_4 2
150524 6641042 1.025482e+09 text_5 1