我的Dataframe看起来像这样:
campaign_name campaign_id event_name clicks installs conversions
campaign_1 1234 registration 100 5 1
campaign_1 1234 hv_users_r 100 5 2
campaign_2 2345 registration 500 10 3
campaign_2 2345 hv_users_w 500 10 2
campaign_3 3456 registration 1000 50 10
campaign_4 3456 hv_users_r 1000 50 15
campaign_4 3456 hv_users_w 1000 50 25
我想对所有"事件名称进行分类"分为2个新列,其中第1个新列代表"注册",第2个新列代表" hv_users",这将是具有事件名称"的所有行的总和。 hv_users_r" &安培; " hv_users_w"
保持这种简单 - "注册" column将包含只有event_name为" registration"的行。所有非"注册" event_names将进入新列" hv_users"。
这是我预期的新Dataframe:
campaign_name campaign_id clicks installs registrations hv_users
campaign_1 1234 100 5 1 2
campaign_2 2345 500 10 3 2
campaign_3 3456 1000 50 10 40
有人可以告诉我如何从输入DataFrame到输出DataFrame吗?
答案 0 :(得分:1)
您可以使用split
+ join
,然后使用groupby
+ unstack
df.assign(event_name=df['event_name'].apply(lambda x:"_".join(x.split("_", 2)[:2]))).\
groupby(['ampaign_name','campaign_id','clicks','installs','event_name'])['conversions'].sum().\
unstack(fill_value=0).reset_index()
Out[302]:
event_name ampaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0
答案 1 :(得分:1)
df['hv_users'] = df.conversions.where(df.event_name.str.match(r'hv_users_[r|w]'), 0)
df['registrations'] = df.conversions.where(df.event_name == 'registration', 0)
df.hv_users = df.groupby('campaign_id').hv_users.transform(sum)
df = df.groupby('campaign_id').head(1).drop('event_name', axis=1)
答案 2 :(得分:0)
pd.crosstab()和pd.pivot()应该可以解决问题。
#df is your input dataframe
replacement = {'hv_users_w':'hv_users', 'hv_users_r':'hv_users','registration':'registration'}
df.event_name = df.event_name.map(replacement)
df1 = pd.crosstab(df.campaign_name, df.event_name)
df2 = pd.pivot_table(df, index = 'campaign_name')
output = pd.concat([df1,df2], axis = 1)
答案 3 :(得分:0)
尝试使用pivot_table
df.loc[df['event_name'].str.contains('_'), 'event_name'] = df.loc[df['event_name'].str.contains('_'), 'event_name'].str.extract('(.*_.*)_.*', expand = False)
new_df = df.pivot_table(index=['campaign_name', 'campaign_id','clicks', 'installs'], columns='event_name', values = 'conversions',aggfunc='sum',fill_value=0).reset_index().rename_axis(None, axis=1)
campaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0