尝试在此GroupBy上进行速度改进,并提出以更快的代码代替它的想法。
目标是创建一个“标准化名称”列,该列是基于“位置ID”的最常出现的列。有什么想法可以更有效地以熊猫方式实现相同的结果吗?
这是我的入门数据框:
Company Name Location ID
0 jones LLC F55555JONE
1 jones LLC F55555JONE
2 jones F55555JONE
3 alpha Co F11111ALPH
4 alpha Co F11111ALPH
5 alpha F11111ALPH
以下是带有timeit的两个可用版本:
df.groupby(["Location ID"])["Company Name"].agg(lambda x: Counter(x).most_common(1)[0][0]).reset_index()
13.2 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.groupby(["Location ID"])["Company Name"].apply(lambda x: x.value_counts().index[0]).reset_index()
# 5.22 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
输出
Location ID Company Name
0 F11111ALPH alpha Co
1 F55555JONE jones LLC
我看到删除Counter的速度提高了大约两倍,但是我在10万行上运行它,认为GroupBy需要继续吗?谢谢
答案 0 :(得分:0)
我们如何尝试mode
df.groupby(["Location ID"],as_index=False)[["Company Name"]].agg(lambda x: x.mode().iloc[0])
Location ID Company Name
0 F11111ALPH alphaCo
1 F55555JONE jonesLLC
答案 1 :(得分:0)
我认为apply
确实在减慢速度。尝试这种替代方法,先进行大小,排序,然后有效地进行drop_duplicates操作,然后再使用该模式。对于平局,“模态”值将是在DataFrame中“第一”出现的值。
gp_cols = ['Location ID']
value_col = 'Company Name'
(df.groupby(gp_cols + [value_col], observed=True, sort=False).size()
.to_frame('counts').reset_index()
.sort_values('counts', ascending=False)
.drop_duplicates(subset=gp_cols)
.drop(columns='counts'))
# Location ID Company Name
#0 F55555JONE jones LLC
#2 F11111ALPH alpha Co
一些时间
import perfplot
import pandas as pd
import numpy as np
def fast_mode(df):
gp_cols = ['Location ID']
value_col = 'Company Name'
return(df.groupby(gp_cols + [value_col], observed=True, sort=False).size()
.to_frame('counts').reset_index()
.sort_values('counts', ascending=False)
.drop_duplicates(subset=gp_cols)
.drop(columns='counts'))
def apply_value_counts(df):
return (df.groupby(['Location ID'])['Company Name']
.apply(lambda x: x.value_counts().index[0]).reset_index())
perfplot.show(
setup=lambda n: pd.DataFrame({'Location ID': np.random.randint(0, n//50+1, n),
'Company Name': np.random.randint(0, n//500+1, n)}),
kernels=[
lambda df: fast_mode(df),
lambda df: apply_value_counts(df),
],
labels=['Fast Mode', 'Apply Value Counts'],
n_range=[2 ** k for k in range(2, 24)],
equality_check = None, # When tied may differ, also in terms of sorted output
xlabel='~len(df)'
)