Question

我有一个数据框，如下所示

Unit_ID             Unit_Create_Year
1                   2011
2                   2011
3                   2012
4                   2014
5                   2012
6                   2015
7                   2017
8                   2017
9                   2019

我想从上面的数据框中准备下面的数据

预期输出：

Year         Number_of_Unit_Since_Year       List_of_Units
2011         2                               [1,2]
2012         4                               [1,2,3,5]
2013         4                               [1,2,3,5]
2014         5                               [1,2,3,5,4]
2015         6                               [1,2,3,5,4,6]
2016         6                               [1,2,3,5,4,6]
2017         8                               [1,2,3,5,4,6,7,8]
2018         8                               [1,2,3,5,4,6,7,8]
2019         9                               [1,2,3,5,4,6,7,8,9]

如果单位是在2011年创建的，则应将其计入下一年的所有金额。

步骤： 2011年，两个部门分别创建了“ 1”和“ 2”。 2012年，两个部门分别创建了“ 3”和“ 5”。因此，2012年将有4个单位，包括2011年的单位。

Answer 1

df = pd.DataFrame({
    'unit_id' : [1, 2, 3, 4, 5, 6, 7, 8, 9],
    'activity_gur' : [2011,2011,2012,2014,2012,2015,2017,2017,2017]})

def fill_number_of_unit_since_year(year):
    return df[df['activity_gur'] == year]['unit_id'].nunique()

def fill_list_of_units(year):
    return df[df['activity_gur'] <= year]['unit_id'].unique()

final_df = pd.DataFrame({'year' : df['activity_gur'].unique()})
final_df['number_of_unit_since_year'] = final_df['year'].apply(fill_number_of_unit_since_year)
final_df['number_of_unit_since_year'] = final_df['number_of_unit_since_year'].cumsum()
final_df['list_of_units'] = final_df['year'].apply(fill_list_of_units)
final_df

Answer 2

您可以尝试以下操作：

df_new = df.groupby(['Unit_Create_Year']).agg({'Unit_ID':['count','unique']}).reset_index()
df_new.columns = ['Year','Number_of_Unit_Since_Year','List_of_Units']
df_new['Number_of_Unit_Since_Year'] = df_new['Number_of_Unit_Since_Year'].cumsum()
df_new['List_of_Units'] = df_new['List_of_Units'].apply(lambda x : x.tolist()).cumsum()

df_new


   Year  Number_of_Unit_Since_Year                List_of_Units
0  2011                          2                       [1, 2]
1  2012                          4                 [1, 2, 3, 5]
2  2014                          5              [1, 2, 3, 5, 4]
3  2015                          6           [1, 2, 3, 5, 4, 6]
4  2017                          9  [1, 2, 3, 5, 4, 6, 7, 8, 9]

Answer 3

这应该可以解决问题：

df2=pd.DataFrame(index=list(range(2011,2020)), columns=["Number_of_units_since_year"], data=[np.nan]*(2020-2011))

df=df.sort_values("Unit_Create_Year").set_index("Unit_Create_Year").expanding().count().reset_index().groupby("Unit_Create_Year").max()

df2.loc[df.index.values]=df

df2=df2.ffill().astype(int).reset_index().rename(columns={"index": "Year"})

输出：

   Year  Number_of_units_since_year
0  2011                           2
1  2012                           4
2  2013                           4
3  2014                           5
4  2015                           6
5  2016                           6
6  2017                           9
7  2018                           9
8  2019                           9

基于熊猫年列的累积计数

3 个答案: