约束groupby操作

时间:2018-02-05 16:15:51

标签: python pandas

我有三个数据框显示如下:

  1. primarydf 是一个数据框,显示了一本书 TitleCode Type (数字/物理), WeekEnding (含义)单位数据是在此日期结束的那一周)和 TotalUnits (已售出的单位数)。
  2. attachdf 是一个数据框,显示与该书相关的产品,该产品是在该书上市后发布的。
  3. expecteddf 是为了清晰起见,我想要通过一些额外的列实现的详细版本。
  4. 我想要的实际列是 TitleCode,类型,第1周附着率,4周附着率,LTD附加率总附加单位

    • 第1周A =第1周销售的附加产品。

    • Week1U =在附加产品发布的第1周内销售的书籍。

    • Week4A =附加产品销售至第4周(包括第4周)。

    • Week4U =已售出的图书,包括所附产品发布的第4周。

    • LTDA =附近产品销售至最近日期。

    • LTDU =截至最近日期销售的图书。

    例如,['Week 1 Attach Rate'] = ['Week1A']/['Week1U'],['Week 4 Attach Rate'] = ['Week4A']/['Week4U']['LTD Attach Rate'] = ['LTDA']/['LTD1U']

    Dataframes

    import pandas as pd
    
    from io import StringIO
    attachproduct = StringIO("""TitleCode,Type,WeekEnding,TotalUnits
    A,1,12/16/2017 0:00,548
    A,1,12/23/2017 0:00,74
    A,1,12/30/2017 0:00,87
    A,1,1/6/2018 0:00,4
    A,1,1/13/2018 0:00,878
    A,2,12/16/2017 0:00,49
    A,2,12/23/2017 0:00,8498
    A,2,12/30/2017 0:00,84
    A,2,1/6/2018 0:00,74
    A,2,1/13/2018 0:00,453
    B,1,12/23/2017 0:00,68
    B,1,12/30/2017 0:00,573
    B,1,1/6/2018 0:00,75
    B,1,1/13/2018 0:00,752
    B,1,1/20/2018 0:00,75
    B,2,12/23/2017 0:00,17
    B,2,12/30/2017 0:00,98
    B,2,1/6/2018 0:00,7875
    B,2,1/13/2018 0:00,73
    B,2,1/20/2018 0:00,75
    C,1,12/23/2017 0:00,79
    C,1,12/30/2017 0:00,75727
    C,1,1/6/2018 0:00,77
    C,1,1/13/2018 0:00,3727
    C,1,1/20/2018 0:00,72
    C,1,1/27/2018 0:00,7275
    C,1,2/3/2018 0:00,27
    """)
    
    primaryproduct = StringIO("""TitleCode,Type,WeekEnding,TotalUnits
    A,1,11/11/2017 0:00,830
    A,1,11/18/2017 0:00,830
    A,1,11/25/2017 0:00,830
    A,1,12/2/2017 0:00,132
    A,1,12/9/2017 0:00,161
    A,1,12/16/2017 0:00,6
    A,1,12/23/2017 0:00,1701
    A,1,12/30/2017 0:00,1240
    A,2,11/11/2017 0:00,141
    A,2,11/18/2017 0:00,141
    A,2,11/25/2017 0:00,141
    A,2,12/2/2017 0:00,22388
    A,2,12/9/2017 0:00,255
    A,2,12/16/2017 0:00,90
    A,2,12/23/2017 0:00,1471
    A,2,12/30/2017 0:00,1010
    A,2,1/6/2018 0:00,8
    A,2,1/13/2018 0:00,9
    B,1,12/2/2017 0:00,254
    B,1,12/9/2017 0:00,1022
    B,1,12/16/2017 0:00,241
    B,1,12/23/2017 0:00,1532
    B,1,12/30/2017 0:00,122
    B,1,1/6/2018 0:00,442
    B,1,1/13/2018 0:00,761
    B,1,1/20/2018 0:00,1081
    B,2,12/2/2017 0:00,49
    B,2,12/9/2017 0:00,351
    B,2,12/16/2017 0:00,19951
    B,2,12/23/2017 0:00,253
    B,2,12/30/2017 0:00,282
    B,2,1/6/2018 0:00,601
    B,2,1/13/2018 0:00,921
    B,2,1/20/2018 0:00,1241
    C,1,11/25/2017 0:00,273
    C,1,12/2/2017 0:00,151944
    C,1,12/9/2017 0:00,95
    C,1,12/16/2017 0:00,8736
    C,1,12/23/2017 0:00,172
    C,1,12/30/2017 0:00,15005
    C,1,1/6/2018 0:00,51
    C,1,1/13/2018 0:00,52
    C,1,1/20/2018 0:00,45
    C,1,1/27/2018 0:00,6
    C,1,2/3/2018 0:00,55
    """)
    
    expected = StringIO("""TitleCode,Type,Week1A,Week4A,LTDA,Week1M,Week4M,LTDM,Week 1 Attach Rate,4 Week Attach Rate,LTD Attach Rate,Total Attached Units
    A,1,548,713,1591,2789,5731,5731,19.6%,12.4%,27.8%,1591
    A,2,49,8705,9158,23155,25644,25653,0.2%,33.9%,35.7%,9158
    B,1,68,1468,1543,3049,301644,5455,2.2%,0.5%,28.3%,1543
    B,2,17,8063,8138,20604,22408,23648,0.1%,36.0%,34.4%,8138
    C,1,79,79610,86984,161220,176327,176433,0.0%,45.1%,49.3%,86984
    """)
    
    attachdf = pd.read_csv(attachproduct, parse_dates=True)
    primarydf = pd.read_csv(primaryproduct, parse_dates=True)
    expecteddf = pd.read_csv(expected, parse_dates=True)
    
    attachdf['WeekEnding']=pd.to_datetime(attachdf['WeekEnding'])
    primarydf['WeekEnding']=pd.to_datetime(primarydf['WeekEnding'])
    

    获得生命迄今为止的价值观是直截了当的,但我不确定我是否了解受约束的第1周和第4周费率的最佳方法。

    ltdattach=attachdf.groupby(['TitleCode','Type']).sum())/(primarydf.groupby(['TitleCode','Type']).sum())
    

1 个答案:

答案 0 :(得分:1)

考虑使用groupbytransform计算列并加入或合并辅助数据框, week1df week4df

# ADD NEW COLUMNS TO ATTACH DF
attachdf['WeekNo'] = attachdf.groupby(['TitleCode', 'Type']).cumcount()+1
attachdf['Week4A'] = attachdf[attachdf['WeekNo']<=4].groupby(['TitleCode', 'Type'])['TotalUnits'].transform('sum')
attachdf['LTDA'] = attachdf.groupby(['TitleCode', 'Type'])['TotalUnits'].transform('sum')


# ADD NEW COLUMNS TO PRIMARY DF
primarydf['LTDM'] = primarydf.groupby(['TitleCode', 'Type'])['TotalUnits'].transform('sum')
primarydf['WeekNo'] = primarydf.groupby(['TitleCode', 'Type']).cumcount()+1


# WEEK 1 DF (LEFT JOIN MERGE)
week1df = primarydf.merge(attachdf[attachdf['WeekNo']==1], on=['TitleCode', 'Type'], 
                          suffixes=['', '_'], how='left').query('WeekEnding <= WeekEnding_')
week1df['Week1M'] = week1df.groupby(['TitleCode', 'Type'])['TotalUnits'].transform('sum')
week1df = week1df[week1df['WeekNo']==1][['TitleCode', 'Type', 'TotalUnits_', 'Week4A', 'LTDA', 'Week1M']]\
                  .rename(columns={'TotalUnits_':'Week1A'})


# WEEK 4 DF (LEFT JOIN MERGE)
week4df = primarydf.merge(attachdf[attachdf['WeekNo']==4], on=['TitleCode', 'Type'], 
                          suffixes=['', '_'], how='left').query('WeekEnding <= WeekEnding_')
week4df['Week4M'] = week4df.groupby(['TitleCode', 'Type'])['TotalUnits'].transform('sum')
week4df = week4df[week4df['WeekNo']==1][['TitleCode', 'Type', 'Week4M', 'LTDM']]


# FINAL (MERGE WEEKS WITH PCT COLUMNS)
finaldf = week1df.merge(week4df, on=['TitleCode', 'Type'])

finaldf['Week 1 Attach Rate'] = finaldf['Week1A'] / finaldf['Week1M']
finaldf['Week 4 Attach Rate'] = finaldf['Week4A'] / finaldf['Week4M']
finaldf['LTD Attach Rates'] = finaldf['LTDA'] / finaldf['LTDM']
finaldf['Total Attached Units'] = finaldf['LTDA']

<强>输出

print(finaldf)

#    TitleCode  Type  Week1A   Week4A   LTDA  Week1M  Week4M    LTDM  Week 1 Attach Rate  Week 4 Attach Rate  LTD Attach Rates  Total Attached Units
# 0          A     1     548    713.0   1591    2789    5730    5730            0.196486            0.124433          0.277661                  1591
# 1          A     2      49   8705.0   9158   23156   25645   25654            0.002116            0.339442          0.356981                  9158
# 2          B     1      68   1468.0   1543    3049    4374    5455            0.022302            0.335620          0.282860                  1543
# 3          B     2      17   8063.0   8138   20604   22408   23649            0.000825            0.359827          0.344116                  8138
# 4          C     1      79  79610.0  86984  161220  176328  176434            0.000490            0.451488          0.493012                 86984