在pandas中计算groupby的nunique()

时间:2018-03-15 10:49:38

标签: python pandas pandas-groupby

我有一个包含列的数据框:

  1. diff - 注册日期和付款日期之间的差异,以天为单位
  2. country - 用户所在国家/地区
  3. user_id
  4. campaign_id - 另一个分类栏,我们将在groupby中使用
  5. 我需要计算每个country + campaign_iddiff< = n的不同用户数。 例如,对于country' A',campaign' abc'和diff 7我需要统计来自country' A',campaign' abc'和diff < = 7

    我目前的解决方案(下方)工作时间太长

    import pandas as pd
    import numpy as np
    
    ## generate test dataframe
    df = pd.DataFrame({
            'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
            'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
            'diff':np.random.choice(range(10), 10000),
            'user_id': np.random.choice(range(1000), 10000)
            })
    ## main
    result_df = pd.DataFrame()
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)
    

    也许有更好的方法可以做到这一点?

2 个答案:

答案 0 :(得分:3)

首先使用列表理解与concatassign一起加入,然后groupbynunique一起添加列diff,最后重命名列,如果必需为自定义列顺序添加reindex

df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
          .nunique()
          .reset_index()
          .rename(columns={'user_id':'unique_ppl'})
          .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))

答案 1 :(得分:1)

下面有一个替代方案,但@jezrael's solution是最佳选择。

效果基准

%timeit original(df)  # 149ms
%timeit jp(df)        # 81ms
%timeit jez(df)       # 47ms

def original(df):
    result_df = pd.DataFrame()
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

    return result_df

def jp(df):

    result_df = pd.DataFrame()
    lst = []
    lst_append = lst.append
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        lst_append(tmp_df)

    result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)

    return result_df

def jez(df):
    df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
    df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
              .nunique()
              .reset_index()
              .rename(columns={'user_id':'unique_ppl'})
              .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
    return df2