我的数据有3列 - person_id
,group_id
和score
。
每个人可能在不同的group_id
中有多个记录。我想对每个人group_id
的分数应用一个函数。例如,为person 0
和group_id 0
应用一些聚合函数,为person 1
group_id 0
应用相同的函数,但为group_id 1
应用不同的函数。
我知道如何使用for循环,但在大数据集上它是非常低效的。有任何想法如何使用groupby
?
以下是一些代码:
n = 100
person_id = np.random.randint(0,10,size=n)
group_id = np.random.randint(0,3,size=n)
score = np.random.rand(n)
df = pd.DataFrame([person_id,group_id,score]).T
df.columns = ['PERSON_ID','GROUP_ID','SCORE']
score_summary = []
for person in df['PERSON_ID'].unique():
idx0 = (df['PERSON_ID'].values==person) & (df['GROUP_ID'].values==0)
score0 = np.mean(5.0*df.loc[idx0,'SCORE']+2)
idx1 = (df['PERSON_ID'].values==person) & (df['GROUP_ID'].values==1)
score1 = np.mean(6.0*df.loc[idx1,'SCORE']+2)
idx2 = (df['PERSON_ID'].values==person) & (df['GROUP_ID'].values==2)
score2 = np.mean(5.0*df.loc[idx2,'SCORE']+3)
score_summary.append({'PERSON_ID':person,
'SCORE0': score0,
'SCORE1': score1,
'SCORE2': score2})
df_summary = pd.DataFrame(score_summary)
df_summary.head()
编辑:
我发现另一种方法似乎在大型数据集上运行得更快(数量级):
df['NEW_SCORE'] = np.nan
df.loc[df['GROUP_ID']==0,'NEW_SCORE'] = df.loc[df['GROUP_ID']==0,'SCORE'].apply(lambda x: 5.0*x+2).values.reshape(-1)
df.loc[df['GROUP_ID']==1,'NEW_SCORE'] = df.loc[df['GROUP_ID']==1,'SCORE'].apply(lambda x: 6.0*x+2).values.reshape(-1)
df.loc[df['GROUP_ID']==2,'NEW_SCORE'] = df.loc[df['GROUP_ID']==2,'SCORE'].apply(lambda x: 5.0*x+3).values.reshape(-1)
df1 = df.groupby(['PERSON_ID','GROUP_ID']).mean()
df_summary2 = df1.reset_index().pivot('PERSON_ID','GROUP_ID','NEW_SCORE')
答案 0 :(得分:0)
设计问题与groupby
的关系是只传递了一个系列。在按Person_Id进行分组时,您需要执行一些黑客来检查Group_Id值。这似乎很贵。
因此,我发现您的解决方案很有效。但是,如果您真的想要削减代码并且性能不是问题,可以使用df.groupby.apply
,如下所示。
n = 100
person_id = np.random.randint(0,10,size=n)
group_id = np.random.randint(0,3,size=n)
score = np.random.rand(n)
df = pd.DataFrame([person_id,group_id,score]).T
df.columns = ['PERSON_ID','GROUP_ID','SCORE']
df = pd.concat([df]*1000)
def original(df):
score_summary = []
for person in df['PERSON_ID'].unique():
idx0 = (df['PERSON_ID'].values==person) & (df['GROUP_ID'].values==0)
score0 = np.mean(5.0*df.loc[idx0,'SCORE']+2)
idx1 = (df['PERSON_ID'].values==person) & (df['GROUP_ID'].values==1)
score1 = np.mean(6.0*df.loc[idx1,'SCORE']+2)
idx2 = (df['PERSON_ID'].values==person) & (df['GROUP_ID'].values==2)
score2 = np.mean(5.0*df.loc[idx2,'SCORE']+3)
score_summary.append({'PERSON_ID':person,
'SCORE0': score0,
'SCORE1': score1,
'SCORE2': score2})
df_summary = pd.DataFrame(score_summary)
return df_summary
def jp(df):
idx = {k: set(df[df['GROUP_ID'] == k].index) for k in (0, 1, 2)}
d = {0: (5, 2), 1: (6, 2), 2: (5, 3)}
def func(x):
return tuple(d[k][0] * np.mean(x[x.index.isin(idx[k])]) + d[k][1] \
for k in (0, 1, 2))
return df.groupby(['PERSON_ID'])['SCORE'].apply(func).reset_index()
%timeit original(df) # 39.7ms
%timeit jp(df) # 78ms
答案 1 :(得分:0)
考虑创建单独的用户定义函数并传入groupby().agg()
:
def s1(x):
return np.mean(5.0 * x + 2)
def s2(x):
return np.mean(6.0 * x + 2)
def s3(x):
return np.mean(5.0 * x + 3)
df_summary = df.groupby(['PERSON_ID', 'GROUP_ID']).agg([s1, s2, s3])
df_summary.head() # USE np.random.seed(222) TO REPRODUCE
# SCORE
# s1 s2 s3
# PERSON_ID GROUP_ID
# 0.0 0.0 3.209123 3.450948 4.209123
# 2.0 5.295679 5.954815 6.295679
# 1.0 0.0 5.012666 5.615199 6.012666
# 2.0 4.621171 5.145406 5.621171
# 2.0 0.0 3.926392 4.311670 4.926392
答案 2 :(得分:0)
我找到了一种有效的方法(时间明智)来执行任务。它仍然不是那么高效的代码,但对于更大的数据集,它的执行速度比for循环解决方案要快得多:
close()