Pandas百分比基于每个索引中的组

时间:2017-01-26 06:26:22

标签: python pandas group-by percentile

我有一个数据框,索引有日期(有多个相同的日期)。对于每个日期,都有诸如价格,分数,类别等列....

我想在数据框中使用一个名为pctrank的新列。

在pctrank列中,我想根据得分值计算每个指数级别的每个类别中的百分等级。例如,在2007年1月24日的以下数据中,我会对超市的所有分数进行百分比排名,并分别对该日期所有退休人员的所有分数进行百分比排名,然后转到下一个日期。

由于数据集很大,我希望它合理有效。

**以下示例数据**

df的子集:

            Category    SCORE
1/24/2017   SuperMarket 12
1/24/2017   Resteraunt  21
1/24/2017   SuperMarket 13
1/24/2017   SuperMarket 22
1/24/2017   Resteraunt  27
1/24/2017   SuperMarket 30
1/24/2017   Resteraunt  34
1/24/2017   Resteraunt  32
1/24/2017   Resteraunt  21
1/24/2017   Resteraunt  12
1/24/2017   Bar         10
1/24/2017   Bar          3
1/24/2017   Bar         24
1/25/2017   Resteraunt  32
1/25/2017   Resteraunt  63
1/25/2017   Resteraunt  32
1/25/2017   Bar         12
1/25/2017   Bar         32
1/25/2017   Hospital    22
1/25/2017   Hospital    12
1/25/2017   Bar         10

示例输出:

            Category    SCORE   Percnt rank    
1/24/2017   SuperMarket 12         0    
1/24/2017   Resteraunt  21         0.2  
1/24/2017   SuperMarket 13        0.333 
1/24/2017   SuperMarket 22        0.666  
1/24/2017   Resteraunt  27       0.6   
1/24/2017   SuperMarket 30         1    
1/24/2017   Resteraunt  34         1    
1/24/2017   Resteraunt  32       0.8   
1/24/2017   Resteraunt  21       0.2    
1/24/2017   Resteraunt  12       0  
1/24/2017   Bar         10       0.5    
1/24/2017   Bar          3       0   
1/24/2017   Bar         24       1  
1/25/2017   Resteraunt  32       0  
1/25/2017   Resteraunt  63       1  
1/25/2017   Resteraunt  32       0  
1/25/2017   Bar         12      0.5 
1/25/2017   Bar         32       1  
1/25/2017   Hospital    22      1   
1/25/2017   Hospital    12      0   
1/25/2017   Bar         10     0    

真实数据集包含大量日期和相应的条目。

2 个答案:

答案 0 :(得分:1)

您可以使用groupby rank除以nunique - 来自0的起点必须减去1

df['Percnt rank'] = df.reset_index() \
                      .groupby(['index','Category'])['SCORE'] \
                      .apply(lambda x: (x.rank(method='dense') - 1) / (x.nunique() - 1) ) \
                      .values
print (df)

              Category  SCORE  Percnt rank
1/24/2017  SuperMarket     12     0.000000
1/24/2017   Resteraunt     21     0.250000
1/24/2017  SuperMarket     13     0.333333
1/24/2017  SuperMarket     22     0.666667
1/24/2017   Resteraunt     27     0.500000
1/24/2017  SuperMarket     30     1.000000
1/24/2017   Resteraunt     34     1.000000
1/24/2017   Resteraunt     32     0.750000
1/24/2017   Resteraunt     21     0.250000
1/24/2017   Resteraunt     12     0.000000
1/24/2017          Bar     10     0.500000
1/24/2017          Bar      3     0.000000
1/24/2017          Bar     24     1.000000
1/25/2017   Resteraunt     32     0.000000
1/25/2017   Resteraunt     63     1.000000
1/25/2017   Resteraunt     32     0.000000
1/25/2017          Bar     12     0.500000
1/25/2017          Bar     32     1.000000
1/25/2017     Hospital     22     1.000000
1/25/2017     Hospital     12     0.000000
1/25/2017          Bar     10     0.000000

如果使用默认rank,则输出不同:

df['Percnt rank'] = df.reset_index()\
                      .groupby(['index','Category'])['SCORE'].rank(method='dense', pct=True)\
                      .values
print (df)
              Category  SCORE  Percnt rank
1/24/2017  SuperMarket     12     0.250000
1/24/2017   Resteraunt     21     0.333333
1/24/2017  SuperMarket     13     0.500000
1/24/2017  SuperMarket     22     0.750000
1/24/2017   Resteraunt     27     0.500000
1/24/2017  SuperMarket     30     1.000000
1/24/2017   Resteraunt     34     0.833333
1/24/2017   Resteraunt     32     0.666667
1/24/2017   Resteraunt     21     0.333333
1/24/2017   Resteraunt     12     0.166667
1/24/2017          Bar     10     0.666667
1/24/2017          Bar      3     0.333333
1/24/2017          Bar     24     1.000000
1/25/2017   Resteraunt     32     0.333333
1/25/2017   Resteraunt     63     0.666667
1/25/2017   Resteraunt     32     0.333333
1/25/2017          Bar     12     0.666667
1/25/2017          Bar     32     1.000000
1/25/2017     Hospital     22     1.000000
1/25/2017     Hospital     12     0.500000
1/25/2017          Bar     10     0.333333

答案 1 :(得分:1)

使用自定义函数我计算rank(method='dense', pct=True)不包括最小值,然后用0

重新填充
def prank(s):
    mask = s.values != s.values.min()
    r = pd.Series(index=s.index)
    r.loc[mask] = s.loc[mask].rank(method='dense', pct=True)
    return r.fillna(0)


df.assign(**{'Percent rank': df.reset_index().groupby(['index', 'Category']).SCORE.apply(prank).values})

              Category  SCORE  Percent rank
1/24/2017  SuperMarket     12      0.000000
1/24/2017   Resteraunt     21      0.200000
1/24/2017  SuperMarket     13      0.333333
1/24/2017  SuperMarket     22      0.666667
1/24/2017   Resteraunt     27      0.400000
1/24/2017  SuperMarket     30      1.000000
1/24/2017   Resteraunt     34      0.800000
1/24/2017   Resteraunt     32      0.600000
1/24/2017   Resteraunt     21      0.200000
1/24/2017   Resteraunt     12      0.000000
1/24/2017          Bar     10      0.500000
1/24/2017          Bar      3      0.000000
1/24/2017          Bar     24      1.000000
1/25/2017   Resteraunt     32      0.000000
1/25/2017   Resteraunt     63      1.000000
1/25/2017   Resteraunt     32      0.500000
1/25/2017          Bar     12      0.500000
1/25/2017          Bar     32      1.000000
1/25/2017     Hospital     22      1.000000
1/25/2017     Hospital     12      0.000000
1/25/2017          Bar     10      0.000000