我有一个df
,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 NaN
5 MMG 1.0 2021-10-01 12.50
6 MMG 2.0 2021-10-01 11.75
7 MMG 3.0 2011-01-01 NaN
8 ZZZ 4.0 2023-01-01 19.00
9 ZZZ 3.0 2014-01-01 13.00
10 MMM 1.0 2016-03-01 12.01
11 MMM 2.0 2019-01-01 16.00
12 ZZZ 1.0 2009-06-01 12.50
13 ZZZ 2.0 2018-01-01 19.00
我需要这样的输出
param per per_date per_num
0 MMG 1.0 2021-10-01 12.50
1 MMG 2.0 2021-10-01 11.75
2 MMG 3.0 2011-01-01 NaN
3 MMM 1.0 2016-03-01 12.01
4 MMM 2.0 2019-01-01 16.00
5 XYZ 1.0 2018-10-01 11.0
6 XYZ 2.0 2017-08-01 15.25
7 XYZ 1.0 2019-10-01 11.25
8 XYZ 2.0 2019-08-01 15.71
9 XYZ 3.0 2020-10-01 NaN
10 ZZZ 1.0 2009-06-01 12.50
11 ZZZ 2.0 2018-01-01 19.00
12 ZZZ 4.0 2023-01-01 19.00
13 ZZZ 3.0 2014-01-01 13.00
但是,当我进行排序时,
df= df.sort_values(['param', 'per']).reset_index(drop=True)
df
我明白了,(不是我想要的)
param per per_date per_num
0 MMG 1.0 2021-10-01 12.50
1 MMG 2.0 2021-10-01 11.75
2 MMG 3.0 2011-01-01 NaN
3 MMM 1.0 2016-03-01 12.01
4 MMM 2.0 2019-01-01 16.00
5 XYZ 1.0 2018-10-01 11.00
6 XYZ 1.0 2019-10-01 11.25
7 XYZ 2.0 2017-08-01 15.25
8 XYZ 2.0 2019-08-01 15.71
9 XYZ 3.0 2020-10-01 NaN
10 ZZZ 1.0 2009-06-01 12.50
11 ZZZ 2.0 2018-01-01 19.00
12 ZZZ 3.0 2014-01-01 13.00
13 ZZZ 4.0 2023-01-01 19.00
如果您在上方看到的原始df
xyz
的{{1}}值为1,2,然后以1,2,3开头,则它们是两个不同的类别。我想保持原样。但是,per
都是一种顺序,因此是一个类别,但是顺序不相同,因此需要排序。如何在熊猫中做到这一点?
赞赏任何建议。
答案 0 :(得分:1)
我们可以使用Categorical
df.param = pd.Categorical(df.param, categories = df.param.unique())
df = df.sort_values(['param', 'per']).reset_index(drop = True)
df
Out[348]:
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.00
1 XYZ 1.0 2019-10-01 11.25
2 XYZ 2.0 2017-08-01 15.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 NaN
5 MMG 1.0 2021-10-01 12.50
6 MMG 2.0 2021-10-01 11.75
7 MMG 3.0 2011-01-01 NaN
8 ZZZ 1.0 2009-06-01 12.50
9 ZZZ 2.0 2018-01-01 19.00
10 ZZZ 3.0 2014-01-01 13.00
11 ZZZ 4.0 2023-01-01 19.00
12 MMM 1.0 2016-03-01 12.01
13 MMM 2.0 2019-01-01 16.00
更新
df.param = pd.Categorical(df.param, categories = df.param.unique())
df['Key']=df.groupby(['param','per']).cumcount()
df = df.sort_values(['param','Key', 'per']).reset_index(drop = True).drop('Key',1)
df
Out[375]:
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.00
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 3.0 2020-10-01 NaN
3 XYZ 1.0 2019-10-01 11.25
4 XYZ 2.0 2019-08-01 15.71
5 MMG 1.0 2021-10-01 12.50
6 MMG 2.0 2021-10-01 11.75
7 MMG 3.0 2011-01-01 NaN
8 ZZZ 1.0 2009-06-01 12.50
9 ZZZ 2.0 2018-01-01 19.00
10 ZZZ 3.0 2014-01-01 13.00
11 ZZZ 4.0 2023-01-01 19.00
12 MMM 1.0 2016-03-01 12.01
13 MMM 2.0 2019-01-01 16.00
答案 1 :(得分:1)
更新:
df.assign(sortkey=df.groupby('param')
.apply(lambda x:x.duplicated(['param','per']).cumsum())
.reset_index(level=0, drop=True))\
.sort_values(['param','sortkey','per'])
输出:
param per per_date per_num sortkey
5 MMG 1.0 2021-10-01 12.50 0
6 MMG 2.0 2021-10-01 11.75 0
7 MMG 3.0 2011-01-01 NaN 0
10 MMM 1.0 2016-03-01 12.01 0
11 MMM 2.0 2019-01-01 16.00 0
0 XYZ 1.0 2018-10-01 11.00 0
1 XYZ 2.0 2017-08-01 15.25 0
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 NaN 2
12 ZZZ 1.0 2009-06-01 12.50 0
13 ZZZ 2.0 2018-01-01 19.00 0
9 ZZZ 3.0 2014-01-01 13.00 0
8 ZZZ 4.0 2023-01-01 19.00 0
答案 2 :(得分:1)
怎么样?
df.assign(sortkey=-df.groupby(['param','per']).cumcount()).sort_values(['param','sortkey','per']).reset_index(drop=True)
答案 3 :(得分:0)
您需要按参数和每个值进行第一个分组,然后将唯一的每个值分配给一个分组程序。然后再次按参数和石斑鱼分组,并按参数和每个
对每个组进行排序import pandas as pd
from pandas import Timestamp
df = pd.DataFrame([['XYZ', 1.0, Timestamp('2018-10-01 00:00:00'), 11.0], ['XYZ', 2.0, Timestamp('2017-08-01 00:00:00'), 15.25], ['XYZ', 1.0, Timestamp('2019-10-01 00:00:00'), 11.25], ['XYZ', 2.0, Timestamp('2019-08-01 00:00:00'), 15.71], ['MMG', 1.0, Timestamp('2021-10-01 00:00:00'), 12.5], ['MMG', 2.0, Timestamp('2021-10-01 00:00:00'), 11.75], ['ZZZ', 4.0, Timestamp('2023-01-01 00:00:00'), 19.0], ['ZZZ', 3.0, Timestamp('2014-01-01 00:00:00'), 13.0], ['MMM', 1.0, Timestamp('2016-03-01 00:00:00'), 12.01], ['MMM', 2.0, Timestamp('2019-01-01 00:00:00'), 16.0], ['ZZZ', 1.0, Timestamp('2009-06-01 00:00:00'), 12.5], ['ZZZ', 2.0, Timestamp('2018-01-01 00:00:00'), 19.0]], columns=('param', 'per', 'per_date', 'per_num'))
df["grouper"] = df.groupby(["param", "per"]).cumcount()
df.groupby(["param", "grouper"])\
.apply(lambda g: g.sort_values(["param", "per"]))\
.reset_index(drop=True)
结果
param per per_date per_num grouper
0 MMG 1.0 2021-10-01 12.50 0
1 MMG 2.0 2021-10-01 11.75 0
2 MMM 1.0 2016-03-01 12.01 0
3 MMM 2.0 2019-01-01 16.00 0
4 XYZ 1.0 2018-10-01 11.00 0
5 XYZ 2.0 2017-08-01 15.25 0
6 XYZ 1.0 2019-10-01 11.25 1
7 XYZ 2.0 2019-08-01 15.71 1
8 ZZZ 1.0 2009-06-01 12.50 0
9 ZZZ 2.0 2018-01-01 19.00 0
10 ZZZ 3.0 2014-01-01 13.00 0
11 ZZZ 4.0 2023-01-01 19.00 0