多类别变量

时间:2019-07-03 10:48:55

标签: python pandas pandas-groupby categorical-data

我有此数据:

ID  Page Time_on_page
1    A       60
1    B       80
2    C       120
2    C       30
3    A       10
3    B       50
3    C       60
3    B       30

我必须按ID对其进行分组,并按Page和相关虚拟变量的每个级别取Time_on_page的总和(这是一个简化的版本,我拥有3个以上的唯一页面):

ID  Page_A  Page_B  Page_C  Time_on_page_A  Time_on_page_B  Time_on_page_C
1     1       1        0         60               80              0
2     0       0        1         0                 0              150
3     1       1        1         10                80              60

我尝试过

pd.get_dummies(df, columns=cols, drop_first=False).groupby(['ID','Page'], as_index=False).sum()

但是它不起作用

感谢您的帮助!

5 个答案:

答案 0 :(得分:1)

这是使用pd.pivot_table的一种方式:

out = (pd.pivot_table(data=df, index=df.ID, columns=df.Page, aggfunc='sum')
        .add_prefix('Time_on_page_'))
out.columns = out.columns.droplevel(0)
df2 = out.notna().astype('i1')
df2.columns = df2.columns.str[-6:]
out.assign(**df2).fillna(0).astype(int)

Page  Time_on_page_A  Time_on_page_B  Time_on_page_C  page_A  page_B  page_C
ID                                                                          
1                 60              80               0       1       1       0
2                  0               0             150       0       0       1
3                 10              80              60       1       1       1

答案 1 :(得分:1)

可能使用crosstab,如下所示:

pd.crosstab(df.ID,df.Page,df.Page,aggfunc='nunique').fillna(0).add_prefix('Page_').join(
pd.crosstab(df.ID,df.Page,df.Time_on_page,aggfunc='sum')
    .add_prefix('Time_on_Page_').fillna(0))

Page  Page_A  Page_B  Page_C  Time_on_Page_A  Time_on_Page_B  Time_on_Page_C
ID                                                                          
1        1.0     1.0     0.0            60.0            80.0             0.0
2        0.0     0.0     1.0             0.0             0.0           150.0
3        1.0     1.0     1.0            10.0            80.0            60.0

答案 2 :(得分:0)

df = pd.DataFrame({
        'ID': [1,1,2,2,3,3,3,3],
        'Page': [ 'A', 'B','C','C', 'A', 'B','C','B'],
        'Time_on_page' : [60,80,120,30,10,50,60,30]
    })

# Create Dummies
adf = pd.get_dummies(df, columns=['Page'], drop_first=False).groupby(['ID']).max().reset_index()

# Calculate ID, Page wise Time sums
bdf = df.groupby(['ID','Page'])['Time_on_page'].sum().unstack(['Page']).fillna(0).reset_index()

# Merge both
result = adf.merge(bdf, on=['ID']).drop('Time_on_page', axis=1)

print (result)
    ID      Page_A  Page_B  Page_C  A     B      C
    1        1      1       0      60.0   80.0   0.0
    2        0      0       1      0.0    0.0   150.0
    3        1      1       1      10.0   80.0  60.0

答案 3 :(得分:0)

df1 = df.groupby(['ID', 'Page']).sum().reset_index()
pd.pivot_table(df1, 'Time_on_page', 'ID', 'Page', [len,sum], 0)

结果:

     len       sum         
Page   A  B  C   A   B    C
ID                         
1      1  1  0  60  80    0
2      0  0  1   0   0  150
3      1  1  1  10  80   60

答案 4 :(得分:0)

每列和ID上的分组依据Pageaggunstack。最后,用mapjoin

压平多索引列
df1 = df.groupby(['ID', 'Page']).agg({'Page': lambda x: 1, 'Time_on_page': 'sum'}) \
                                .unstack(fill_value=0)
df1.columns = df1.columns.map('_'.join)


Out[467]:
    Page_A  Page_B  Page_C  Time_on_page_A  Time_on_page_B  Time_on_page_C
ID
1        1       1       0              60              80               0
2        0       0       1               0               0             150
3        1       1       1              10              80              60