我有一个数据框,索引有日期(有多个相同的日期)。对于每个日期,都有诸如价格,分数,类别等列....
我想在数据框中使用一个名为pctrank的新列。
在pctrank列中,我想根据得分值计算每个指数级别的每个类别中的百分等级。例如,在2007年1月24日的以下数据中,我会对超市的所有分数进行百分比排名,并分别对该日期所有退休人员的所有分数进行百分比排名,然后转到下一个日期。
由于数据集很大,我希望它合理有效。
**以下示例数据**
df的子集:
Category SCORE
1/24/2017 SuperMarket 12
1/24/2017 Resteraunt 21
1/24/2017 SuperMarket 13
1/24/2017 SuperMarket 22
1/24/2017 Resteraunt 27
1/24/2017 SuperMarket 30
1/24/2017 Resteraunt 34
1/24/2017 Resteraunt 32
1/24/2017 Resteraunt 21
1/24/2017 Resteraunt 12
1/24/2017 Bar 10
1/24/2017 Bar 3
1/24/2017 Bar 24
1/25/2017 Resteraunt 32
1/25/2017 Resteraunt 63
1/25/2017 Resteraunt 32
1/25/2017 Bar 12
1/25/2017 Bar 32
1/25/2017 Hospital 22
1/25/2017 Hospital 12
1/25/2017 Bar 10
示例输出:
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0
1/24/2017 Resteraunt 21 0.2
1/24/2017 SuperMarket 13 0.333
1/24/2017 SuperMarket 22 0.666
1/24/2017 Resteraunt 27 0.6
1/24/2017 SuperMarket 30 1
1/24/2017 Resteraunt 34 1
1/24/2017 Resteraunt 32 0.8
1/24/2017 Resteraunt 21 0.2
1/24/2017 Resteraunt 12 0
1/24/2017 Bar 10 0.5
1/24/2017 Bar 3 0
1/24/2017 Bar 24 1
1/25/2017 Resteraunt 32 0
1/25/2017 Resteraunt 63 1
1/25/2017 Resteraunt 32 0
1/25/2017 Bar 12 0.5
1/25/2017 Bar 32 1
1/25/2017 Hospital 22 1
1/25/2017 Hospital 12 0
1/25/2017 Bar 10 0
真实数据集包含大量日期和相应的条目。
答案 0 :(得分:1)
您可以使用groupby
rank
除以nunique
- 来自0
的起点必须减去1
:
df['Percnt rank'] = df.reset_index() \
.groupby(['index','Category'])['SCORE'] \
.apply(lambda x: (x.rank(method='dense') - 1) / (x.nunique() - 1) ) \
.values
print (df)
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0.000000
1/24/2017 Resteraunt 21 0.250000
1/24/2017 SuperMarket 13 0.333333
1/24/2017 SuperMarket 22 0.666667
1/24/2017 Resteraunt 27 0.500000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 1.000000
1/24/2017 Resteraunt 32 0.750000
1/24/2017 Resteraunt 21 0.250000
1/24/2017 Resteraunt 12 0.000000
1/24/2017 Bar 10 0.500000
1/24/2017 Bar 3 0.000000
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Resteraunt 63 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Bar 12 0.500000
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.000000
1/25/2017 Bar 10 0.000000
如果使用默认rank
,则输出不同:
df['Percnt rank'] = df.reset_index()\
.groupby(['index','Category'])['SCORE'].rank(method='dense', pct=True)\
.values
print (df)
Category SCORE Percnt rank
1/24/2017 SuperMarket 12 0.250000
1/24/2017 Resteraunt 21 0.333333
1/24/2017 SuperMarket 13 0.500000
1/24/2017 SuperMarket 22 0.750000
1/24/2017 Resteraunt 27 0.500000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 0.833333
1/24/2017 Resteraunt 32 0.666667
1/24/2017 Resteraunt 21 0.333333
1/24/2017 Resteraunt 12 0.166667
1/24/2017 Bar 10 0.666667
1/24/2017 Bar 3 0.333333
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.333333
1/25/2017 Resteraunt 63 0.666667
1/25/2017 Resteraunt 32 0.333333
1/25/2017 Bar 12 0.666667
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.500000
1/25/2017 Bar 10 0.333333
答案 1 :(得分:1)
使用自定义函数我计算rank(method='dense', pct=True)
不包括最小值,然后用0
def prank(s):
mask = s.values != s.values.min()
r = pd.Series(index=s.index)
r.loc[mask] = s.loc[mask].rank(method='dense', pct=True)
return r.fillna(0)
df.assign(**{'Percent rank': df.reset_index().groupby(['index', 'Category']).SCORE.apply(prank).values})
Category SCORE Percent rank
1/24/2017 SuperMarket 12 0.000000
1/24/2017 Resteraunt 21 0.200000
1/24/2017 SuperMarket 13 0.333333
1/24/2017 SuperMarket 22 0.666667
1/24/2017 Resteraunt 27 0.400000
1/24/2017 SuperMarket 30 1.000000
1/24/2017 Resteraunt 34 0.800000
1/24/2017 Resteraunt 32 0.600000
1/24/2017 Resteraunt 21 0.200000
1/24/2017 Resteraunt 12 0.000000
1/24/2017 Bar 10 0.500000
1/24/2017 Bar 3 0.000000
1/24/2017 Bar 24 1.000000
1/25/2017 Resteraunt 32 0.000000
1/25/2017 Resteraunt 63 1.000000
1/25/2017 Resteraunt 32 0.500000
1/25/2017 Bar 12 0.500000
1/25/2017 Bar 32 1.000000
1/25/2017 Hospital 22 1.000000
1/25/2017 Hospital 12 0.000000
1/25/2017 Bar 10 0.000000