从以下类型的数据框中,我希望能够对日期id
字段进行排序和排名:
df = pd.DataFrame({
'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7],
'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
'11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
'05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
'03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
'19/01/2017 16:34:00']})
根据日期对每个id
进行有效排名或索引。
我用过
df.groupby('id')['date'].min()
这允许我提取第一个日期(尽管我不知道如何使用它来过滤出行),但是我可能并不总是需要第一个日期-有时它将是第二个或第三个日期,所以我需要生成带有日期索引的新列-结果如下:
关于此排序/排名/标签有什么想法吗?
我的原始模型忽略了一个非常普遍的问题。
由于可能有一些id
对其并行执行了多个测试,因此它们在日期库中显示为多行,并且具有匹配的日期(date
对应于它们记录的时间) 。这些应该算作同一日期,而不要增加date_rank:我已经生成了一个模型,并更新了date_rank
来演示其外观:
df = pd.DataFrame({
'id':[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6,6,6,7,7],
'value':[.01, .4, .5, .7, .77, .1,.2, 0.3, .11, .21, .4, .01, 3, .5, .8, .9, .1],
'date':['10/01/2017 15:45:00','10/01/2017 15:45:00','05/01/2017 15:56:00',
'11/01/2017 15:22:00','11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00','05/01/2017 09:37:00','05/01/2017 09:55:00',
'05/01/2017 09:55:00','05/01/2017 10:08:00','05/01/2017 10:09:00','03/02/2017 08:55:00',
'03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
'19/01/2017 16:34:00']})
柜台会负担得起:
答案 0 :(得分:2)
您可以通过sort_values,groupby和cumcount
来完成此操作df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1
演示
In [1]: df = pd.DataFrame({
...: 'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7],
...: 'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
...: 'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
...: '11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
...: '05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
...: '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
...: '19/01/2017 16:34:00']})
...:
In [2]: df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1
...:
In [3]: df
Out[3]:
id value date date_rank
0 1 0.01 10/01/2017 15:45:00 1
1 1 0.40 05/01/2017 15:56:00 2
2 2 0.20 11/01/2017 15:22:00 1
3 3 0.30 06/01/2017 11:02:00 1
4 3 0.11 05/01/2017 09:37:00 2
5 4 0.21 05/01/2017 09:55:00 1
6 5 0.40 05/01/2017 10:08:00 1
7 6 0.01 03/02/2017 08:55:00 3
8 6 3.00 03/02/2017 09:15:00 2
9 6 0.50 03/02/2017 09:31:00 1
10 7 0.80 09/01/2017 15:42:00 2
11 7 0.90 19/01/2017 16:34:00 1
修改
您可以通过等级方法做到这一点
df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)
演示
In [1]: df['rank'] = df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)
In [2]: df
Out[2]:
id value date rank
0 1 0.01 2017-10-01 15:45:00 1
1 1 0.40 2017-10-01 15:45:00 1
2 1 0.50 2017-05-01 15:56:00 2
3 2 0.70 2017-11-01 15:22:00 1
4 2 0.77 2017-11-01 15:22:00 1
5 3 0.10 2017-06-01 11:02:00 1
6 3 0.20 2017-05-01 09:37:00 2
7 3 0.30 2017-05-01 09:37:00 2
8 4 0.11 2017-05-01 09:55:00 1
9 4 0.21 2017-05-01 09:55:00 1
10 5 0.40 2017-05-01 10:08:00 2
11 5 0.01 2017-05-01 10:09:00 1
12 6 3.00 2017-03-02 08:55:00 3
13 6 0.50 2017-03-02 09:15:00 2
14 6 0.80 2017-03-02 09:31:00 1
15 7 0.90 2017-09-01 15:42:00 1
16 7 0.10 2017-01-19 16:34:00 2
答案 1 :(得分:1)
您可以尝试按降序和汇总“ id”组值的顺序对日期值进行排序
@praveen的逻辑非常简单,通过扩展逻辑,您可以使用类别的类型将值转换为类别,并可以检索该类别的代码(键),但是与您的预期有点不同输出
df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] =df1.groupby(['id']).apply(lambda x: x['date'].astype('category',ordered=False).cat.codes+1).values
出局:
date id value date_rank
0 10/01/2017 15:45:00 1 0.01 2
1 10/01/2017 15:45:00 1 0.40 2
2 05/01/2017 15:56:00 1 0.50 1
3 11/01/2017 15:22:00 2 0.70 1
4 11/01/2017 15:22:00 2 0.77 1
5 06/01/2017 11:02:00 3 0.10 2
6 05/01/2017 09:37:00 3 0.20 1
7 05/01/2017 09:37:00 3 0.30 1
8 05/01/2017 09:55:00 4 0.11 1
9 05/01/2017 09:55:00 4 0.21 1
11 05/01/2017 10:09:00 5 0.01 2
10 05/01/2017 10:08:00 5 0.40 1
14 03/02/2017 09:31:00 6 0.80 3
13 03/02/2017 09:15:00 6 0.50 2
12 03/02/2017 08:55:00 6 3.00 1
16 19/01/2017 16:34:00 7 0.10 2
15 09/01/2017 15:42:00 7 0.90 1
但是为了获得您的准确输出,在这里我使用了字典和反转字典键以及提取值
df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] = df1.groupby(['id'])['date'].transform(lambda x: list(map(lambda y: dict(map(reversed, dict(enumerate(x.unique())).items()))[y]+1,x)) )
出局:
date id value date_rank
0 10/01/2017 15:45:00 1 0.01 1
1 10/01/2017 15:45:00 1 0.40 1
2 05/01/2017 15:56:00 1 0.50 2
3 11/01/2017 15:22:00 2 0.70 1
4 11/01/2017 15:22:00 2 0.77 1
5 06/01/2017 11:02:00 3 0.10 1
6 05/01/2017 09:37:00 3 0.20 2
7 05/01/2017 09:37:00 3 0.30 2
8 05/01/2017 09:55:00 4 0.11 1
9 05/01/2017 09:55:00 4 0.21 1
11 05/01/2017 10:09:00 5 0.01 1
10 05/01/2017 10:08:00 5 0.40 2
14 03/02/2017 09:31:00 6 0.80 1
13 03/02/2017 09:15:00 6 0.50 2
12 03/02/2017 08:55:00 6 3.00 3
16 19/01/2017 16:34:00 7 0.10 1
15 09/01/2017 15:42:00 7 0.90 2