组合DataFrame&中的行。将值添加为列

时间:2018-01-03 23:10:38

标签: python pandas dataframe pandas-groupby

我的Dataframe看起来像这样:

campaign_name  campaign_id    event_name  clicks  installs  conversions
   campaign_1         1234  registration     100         5            1
   campaign_1         1234    hv_users_r     100         5            2
   campaign_2         2345  registration     500        10            3
   campaign_2         2345    hv_users_w     500        10            2
   campaign_3         3456  registration    1000        50           10
   campaign_4         3456    hv_users_r    1000        50           15
   campaign_4         3456    hv_users_w    1000        50           25

我想对所有"事件名称进行分类"分为2个新列,其中第1个新列代表"注册",第2个新列代表" hv_users",这将是具有事件名称"的所有行的总和。 hv_users_r" &安培; " hv_users_w"

保持这种简单 - "注册" column将包含只有event_name为" registration"的行。所有非"注册" event_names将进入新列" hv_users"。

这是我预期的新Dataframe:

campaign_name  campaign_id  clicks installs  registrations  hv_users
   campaign_1         1234     100        5              1         2
   campaign_2         2345     500       10              3         2
   campaign_3         3456    1000       50             10        40  

有人可以告诉我如何从输入DataFrame到输出DataFrame吗?

4 个答案:

答案 0 :(得分:1)

您可以使用split + join,然后使用groupby + unstack

df.assign(event_name=df['event_name'].apply(lambda x:"_".join(x.split("_", 2)[:2]))).\
    groupby(['ampaign_name','campaign_id','clicks','installs','event_name'])['conversions'].sum().\
      unstack(fill_value=0).reset_index()
Out[302]: 
event_name ampaign_name  campaign_id  clicks  installs  hv_users  registration
0            campaign_1         1234     100         5         2             1
1            campaign_2         2345     500        10         2             3
2            campaign_3         3456    1000        50         0            10
3            campaign_4         3456    1000        50        40             0

答案 1 :(得分:1)

df['hv_users'] = df.conversions.where(df.event_name.str.match(r'hv_users_[r|w]'), 0)
df['registrations'] = df.conversions.where(df.event_name == 'registration', 0)
df.hv_users = df.groupby('campaign_id').hv_users.transform(sum)
df = df.groupby('campaign_id').head(1).drop('event_name', axis=1)

答案 2 :(得分:0)

pd.crosstab()和pd.pivot()应该可以解决问题。

#df is your input dataframe
replacement = {'hv_users_w':'hv_users', 'hv_users_r':'hv_users','registration':'registration'}
df.event_name = df.event_name.map(replacement)
df1 = pd.crosstab(df.campaign_name, df.event_name)    
df2 = pd.pivot_table(df, index = 'campaign_name')
output = pd.concat([df1,df2], axis = 1)

答案 3 :(得分:0)

尝试使用pivot_table

df.loc[df['event_name'].str.contains('_'), 'event_name'] = df.loc[df['event_name'].str.contains('_'), 'event_name'].str.extract('(.*_.*)_.*', expand = False)
new_df = df.pivot_table(index=['campaign_name', 'campaign_id','clicks', 'installs'], columns='event_name', values = 'conversions',aggfunc='sum',fill_value=0).reset_index().rename_axis(None, axis=1)

    campaign_name   campaign_id clicks  installs    hv_users    registration
0   campaign_1      1234        100     5           2           1
1   campaign_2      2345        500     10          2           3
2   campaign_3      3456        1000    50          0           10
3   campaign_4      3456        1000    50          40          0