Question

我有一个数据集，其中有很多区域，每个区域每年都有许多已完成的项目，未完成的项目以及每个特定区域的总体清单。

Region       Year    Completions    Incomplete    Inventory
New York     1999    100            200           1500
New York     2000    150            100           1650
New York     2001    125            100           1775
                           ....
Oregon       1999    100            200           1500
Oregon       2000    150            100           1650
Oregon       2001    125            100           1775

鉴于此输入，我想得出一个区域，年份以及相对于前一年清单的完工百分比的列表

(Current Year Completions / Previous Year Inventory) * 100

结果集应类似于：

Region    Year    Completions
New York  1999    NaN
New York  2000    10%
New York  2001    7.58%
       .........

我已按地区和年份进行了排序，但是如果缺少年份，则会使结果不正确。如果缺少年份，我会期望一个NaN或使用最近的已知年份的值（即，当使用2015年竣工量和2014年库存进行计算时，如果缺少2014年，则使用NaN或使用2013年的值）

data = {'Region':['New York', 'New York', 'New York', 'Oregon', 'Oregon', 'Oregon'], 
        'Year':[1999,2000,2001,1999,2000,2001],
        'Completions':[100,150,125,100,150,125],
        'Incomplete':[200,100,100,200,100,100],
        'Inventory':[1500,1650,1775,1500,1650,1775]
       }

dfa = pd.DataFrame(data)
dfa = dfa.sort_values(by=['Region','Year'])
dfa['Completions'] = (dfa['Completions'] / dfa['Inventory'].shift(1) * 100)
dfa['Incomplete'] = (dfa['Incomplete'] / dfa['Inventory'].shift(1) * 100)
resultDf = dfa[['Region','Year', 'Completions', 'Incomplete']]
resultDf.head()

不足之处是它将使用俄勒冈州1999年的完井量，并将其与2001年纽约的存量进行比较。此外，如果缺少任何年份，它将大大扭曲数据。

是否有更好的方法来考虑这一点？

Answer 1

请考虑以下设置：

使用groupby将 Region 分解为数据帧的字典；
在每个子集上运行reindex以获取连续的年份，例如1999-2019，其中缺失年份数据将收到NaN；
assign需要按年计算列；
最后，concat将字典中的所有子集数据帧放回原处。

以下在字典理解中运行除级联以外的所有内容：

df_dict = {
           k:(d.set_index('Year')
               .reindex(range(1999,2020), axis='index')
               .reset_index()
               .assign(Region = lambda x: x['Region'].ffill(),
                       Completions_YOY = lambda x: (x['Completions'] / x['Inventory'].shift(1) * 100),
                       Incomplete_YOY = lambda x: (x['Incomplete'] / x['Inventory'].shift(1) * 100)
                       )               
              )
           for k, d in df.groupby('Region')
          }

final_df = pd.concat(df_dict, ignore_index=True)

print(final_df.head(10))
#    Year    Region  Completions  Incomplete  Inventory  Completions_YOY  Incomplete_YOY
# 0  1999  New York        100.0       200.0     1500.0              NaN             NaN
# 1  2000  New York        150.0       100.0     1650.0        10.000000        6.666667
# 2  2001  New York        125.0       100.0     1775.0         7.575758        6.060606
# 3  2002  New York          NaN         NaN        NaN              NaN             NaN
# 4  2003  New York          NaN         NaN        NaN              NaN             NaN
# 5  2004  New York          NaN         NaN        NaN              NaN             NaN
# 6  2005  New York          NaN         NaN        NaN              NaN             NaN
# 7  2006  New York          NaN         NaN        NaN              NaN             NaN
# 8  2007  New York          NaN         NaN        NaN              NaN             NaN
# 9  2008  New York          NaN         NaN        NaN              NaN             NaN

计算缺少年份的年同比值

1 个答案: