基于熊猫年列的累积计数

时间:2020-01-25 09:25:00

标签: python pandas pandas-groupby

我有一个数据框,如下所示

Unit_ID             Unit_Create_Year
1                   2011
2                   2011
3                   2012
4                   2014
5                   2012
6                   2015
7                   2017
8                   2017
9                   2019

我想从上面的数据框中准备下面的数据

预期输出:

Year         Number_of_Unit_Since_Year       List_of_Units
2011         2                               [1,2]
2012         4                               [1,2,3,5]
2013         4                               [1,2,3,5]
2014         5                               [1,2,3,5,4]
2015         6                               [1,2,3,5,4,6]
2016         6                               [1,2,3,5,4,6]
2017         8                               [1,2,3,5,4,6,7,8]
2018         8                               [1,2,3,5,4,6,7,8]
2019         9                               [1,2,3,5,4,6,7,8,9]

如果单位是在2011年创建的,则应将其计入下一年的所有金额。

步骤: 2011年,两个部门分别创建了“ 1”和“ 2”。 2012年,两个部门分别创建了“ 3”和“ 5”。因此,2012年将有4个单位,包括2011年的单位。

3 个答案:

答案 0 :(得分:2)

df = pd.DataFrame({
    'unit_id' : [1, 2, 3, 4, 5, 6, 7, 8, 9],
    'activity_gur' : [2011,2011,2012,2014,2012,2015,2017,2017,2017]})

def fill_number_of_unit_since_year(year):
    return df[df['activity_gur'] == year]['unit_id'].nunique()

def fill_list_of_units(year):
    return df[df['activity_gur'] <= year]['unit_id'].unique()

final_df = pd.DataFrame({'year' : df['activity_gur'].unique()})
final_df['number_of_unit_since_year'] = final_df['year'].apply(fill_number_of_unit_since_year)
final_df['number_of_unit_since_year'] = final_df['number_of_unit_since_year'].cumsum()
final_df['list_of_units'] = final_df['year'].apply(fill_list_of_units)
final_df

enter image description here

答案 1 :(得分:2)

您可以尝试以下操作:

df_new = df.groupby(['Unit_Create_Year']).agg({'Unit_ID':['count','unique']}).reset_index()
df_new.columns = ['Year','Number_of_Unit_Since_Year','List_of_Units']
df_new['Number_of_Unit_Since_Year'] = df_new['Number_of_Unit_Since_Year'].cumsum()
df_new['List_of_Units'] = df_new['List_of_Units'].apply(lambda x : x.tolist()).cumsum()

df_new


   Year  Number_of_Unit_Since_Year                List_of_Units
0  2011                          2                       [1, 2]
1  2012                          4                 [1, 2, 3, 5]
2  2014                          5              [1, 2, 3, 5, 4]
3  2015                          6           [1, 2, 3, 5, 4, 6]
4  2017                          9  [1, 2, 3, 5, 4, 6, 7, 8, 9]

答案 2 :(得分:1)

这应该可以解决问题:

df2=pd.DataFrame(index=list(range(2011,2020)), columns=["Number_of_units_since_year"], data=[np.nan]*(2020-2011))

df=df.sort_values("Unit_Create_Year").set_index("Unit_Create_Year").expanding().count().reset_index().groupby("Unit_Create_Year").max()

df2.loc[df.index.values]=df

df2=df2.ffill().astype(int).reset_index().rename(columns={"index": "Year"})

输出:

   Year  Number_of_units_since_year
0  2011                           2
1  2012                           4
2  2013                           4
3  2014                           5
4  2015                           6
5  2016                           6
6  2017                           9
7  2018                           9
8  2019                           9