扩大熊猫列的排名

时间:2020-07-11 03:27:03

标签: python pandas dataframe pandas-groupby

考虑样本DF:

df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df

    a   b   c     d1       d2    date
0   7   1   19  Apple   Orange  2002-01-01
1   3   7   17  Mango   lemon   2002-01-01
2   9   6   4   Apple   lemon   2002-01-01
3   0   5   51  Pine    Orange  2002-01-01
4   4   6   8   Apple   lemon   2002-02-01
5   4   3   1   Mango   Orange  2002-02-01
6   2   2   14  Apple   lemon   2002-02-01
7   5   15  10  Mango   Orange  2002-01-01
8   1   2   10  Pine    lemon   2002-02-01
9   2   1   12  Apple   Orange  2002-02-01

尝试以扩展方式将d1列替换为基于d1列的meanc列的分组。例如,考虑以下前5行:

  1. 默认情况下,第一行的索引为0的值,即Apple将替换为0

  2. 第二行,索引1,值Mango应替换为0,因为仅考虑DF {{1 }} 2的值将为19,而GROUPED_MEAN的值将为17,因此索引Apple上的Mango值应替换为等级0,因为它的均值较低。

  3. 第三行,索引Mango,值1应替换为2,因为仅考虑DF {{1}的前Apple0的}}将是3,而GROUPED_MEAN将是17,因此索引Apple处的Apple值应替换为等级(19+4)/2,因为它的值较低分组均值

  4. 第四行,索引为Mango,值2应替换为0,因为仅考虑DF {{1}的前3行}}的Pine将是24将是17,GROUPED_MEAN将是51,因为Pine在所有3个类别中均具有最高的分组均值-{{1 }},Pine的排名为2。

  5. 第五行,索引Apple,值(19+4)/2应替换为Mango,因为仅考虑DF {{1}的前Pine[Apple, Mango, Pine]的}}将是4,而Apple将是17,0将是51,因为苹果计算机在所有3项中均具有最低的分组平均值-{{1} },苹果将获得等级0。

第d1列的期望值:

5

迭代方法:

GROUPED_MEAN

我能够在DF的每一行中反复进行此操作,但是对于大型DF而言,性能却很差,因此对基于熊猫的方法的任何建议都会很棒。

1 个答案:

答案 0 :(得分:1)

在列1上使用最小窗口大小为c的{​​{3}},并使用自定义Lambda函数exp。在此lambda函数中,我们使用Series.expanding将扩展窗口w按原始数据帧中的列d1进行分组,并使用transform使用mean,最后使用{{ 3}}与method='dense'一起计算排名:

exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)

结果:

# print(df)

   a   b   c     d1      d2        date  d1_new
0  7   1  19  Apple  Orange  2002-01-01       0
1  3   7  17  Mango   lemon  2002-01-01       0
2  9   6   4  Apple   lemon  2002-01-01       0
3  0   5  51   Pine  Orange  2002-01-01       2
4  4   6   8  Apple   lemon  2002-02-01       0
5  4   3   1  Mango  Orange  2002-02-01       0
6  2   2  14  Apple   lemon  2002-02-01       1
7  5  15  10  Mango  Orange  2002-01-01       0
8  1   2  10   Pine   lemon  2002-02-01       2
9  2   1  12  Apple  Orange  2002-02-01       1

性能:

df.shape
(1000, 7)

%%timeit
exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)
3.15 s ± 305 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
expanding(df,["d1"]) # your method
11.9 s ± 449 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)