考虑样本DF:
df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 7 1 19 Apple Orange 2002-01-01
1 3 7 17 Mango lemon 2002-01-01
2 9 6 4 Apple lemon 2002-01-01
3 0 5 51 Pine Orange 2002-01-01
4 4 6 8 Apple lemon 2002-02-01
5 4 3 1 Mango Orange 2002-02-01
6 2 2 14 Apple lemon 2002-02-01
7 5 15 10 Mango Orange 2002-01-01
8 1 2 10 Pine lemon 2002-02-01
9 2 1 12 Apple Orange 2002-02-01
尝试以扩展方式将d1
列替换为基于d1
列的mean
和c
列的分组。例如,考虑以下前5行:
默认情况下,第一行的索引为0
的值,即Apple
将替换为0
第二行,索引1
,值Mango
应替换为0
,因为仅考虑DF {{1 }} 2
的值将为19,而GROUPED_MEAN
的值将为17,因此索引Apple
上的Mango值应替换为等级0,因为它的均值较低。
第三行,索引Mango
,值1
应替换为2
,因为仅考虑DF {{1}的前Apple
行0
的}}将是3
,而GROUPED_MEAN
将是17,因此索引Apple
处的Apple值应替换为等级(19+4)/2
,因为它的值较低分组均值
第四行,索引为Mango
,值2
应替换为0
,因为仅考虑DF {{1}的前3
行}}的Pine
将是2
,4
将是17,GROUPED_MEAN
将是51,因为Pine在所有3个类别中均具有最高的分组均值-{{1 }},Pine的排名为2。
第五行,索引Apple
,值(19+4)/2
应替换为Mango
,因为仅考虑DF {{1}的前Pine
行[Apple, Mango, Pine]
的}}将是4
,而Apple
将是17,0
将是51,因为苹果计算机在所有3项中均具有最低的分组平均值-{{1} },苹果将获得等级0。
第d1列的期望值:
5
迭代方法:
GROUPED_MEAN
我能够在DF的每一行中反复进行此操作,但是对于大型DF而言,性能却很差,因此对基于熊猫的方法的任何建议都会很棒。
答案 0 :(得分:1)
在列1
上使用最小窗口大小为c
的{{3}},并使用自定义Lambda函数exp
。在此lambda函数中,我们使用Series.expanding
将扩展窗口w
按原始数据帧中的列d1
进行分组,并使用transform
使用mean
,最后使用{{ 3}}与method='dense'
一起计算排名:
exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)
结果:
# print(df)
a b c d1 d2 date d1_new
0 7 1 19 Apple Orange 2002-01-01 0
1 3 7 17 Mango lemon 2002-01-01 0
2 9 6 4 Apple lemon 2002-01-01 0
3 0 5 51 Pine Orange 2002-01-01 2
4 4 6 8 Apple lemon 2002-02-01 0
5 4 3 1 Mango Orange 2002-02-01 0
6 2 2 14 Apple lemon 2002-02-01 1
7 5 15 10 Mango Orange 2002-01-01 0
8 1 2 10 Pine lemon 2002-02-01 2
9 2 1 12 Apple Orange 2002-02-01 1
性能:
df.shape
(1000, 7)
%%timeit
exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)
3.15 s ± 305 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
expanding(df,["d1"]) # your method
11.9 s ± 449 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)