我有一个数据集,其中有很多区域,每个区域每年都有许多已完成的项目,未完成的项目以及每个特定区域的总体清单。
Region Year Completions Incomplete Inventory
New York 1999 100 200 1500
New York 2000 150 100 1650
New York 2001 125 100 1775
....
Oregon 1999 100 200 1500
Oregon 2000 150 100 1650
Oregon 2001 125 100 1775
鉴于此输入,我想得出一个区域,年份以及相对于前一年清单的完工百分比的列表
(Current Year Completions / Previous Year Inventory) * 100
结果集应类似于:
Region Year Completions
New York 1999 NaN
New York 2000 10%
New York 2001 7.58%
.........
我已按地区和年份进行了排序,但是如果缺少年份,则会使结果不正确。如果缺少年份,我会期望一个NaN或使用最近的已知年份的值(即,当使用2015年竣工量和2014年库存进行计算时,如果缺少2014年,则使用NaN或使用2013年的值)
data = {'Region':['New York', 'New York', 'New York', 'Oregon', 'Oregon', 'Oregon'],
'Year':[1999,2000,2001,1999,2000,2001],
'Completions':[100,150,125,100,150,125],
'Incomplete':[200,100,100,200,100,100],
'Inventory':[1500,1650,1775,1500,1650,1775]
}
dfa = pd.DataFrame(data)
dfa = dfa.sort_values(by=['Region','Year'])
dfa['Completions'] = (dfa['Completions'] / dfa['Inventory'].shift(1) * 100)
dfa['Incomplete'] = (dfa['Incomplete'] / dfa['Inventory'].shift(1) * 100)
resultDf = dfa[['Region','Year', 'Completions', 'Incomplete']]
resultDf.head()
不足之处是它将使用俄勒冈州1999年的完井量,并将其与2001年纽约的存量进行比较。此外,如果缺少任何年份,它将大大扭曲数据。
是否有更好的方法来考虑这一点?
答案 0 :(得分:1)
请考虑以下设置:
groupby
将 Region 分解为数据帧的字典; reindex
以获取连续的年份,例如1999-2019,其中缺失年份数据将收到NaN
; assign
需要按年计算列; concat
将字典中的所有子集数据帧放回原处。以下在字典理解中运行除级联以外的所有内容:
df_dict = {
k:(d.set_index('Year')
.reindex(range(1999,2020), axis='index')
.reset_index()
.assign(Region = lambda x: x['Region'].ffill(),
Completions_YOY = lambda x: (x['Completions'] / x['Inventory'].shift(1) * 100),
Incomplete_YOY = lambda x: (x['Incomplete'] / x['Inventory'].shift(1) * 100)
)
)
for k, d in df.groupby('Region')
}
final_df = pd.concat(df_dict, ignore_index=True)
print(final_df.head(10))
# Year Region Completions Incomplete Inventory Completions_YOY Incomplete_YOY
# 0 1999 New York 100.0 200.0 1500.0 NaN NaN
# 1 2000 New York 150.0 100.0 1650.0 10.000000 6.666667
# 2 2001 New York 125.0 100.0 1775.0 7.575758 6.060606
# 3 2002 New York NaN NaN NaN NaN NaN
# 4 2003 New York NaN NaN NaN NaN NaN
# 5 2004 New York NaN NaN NaN NaN NaN
# 6 2005 New York NaN NaN NaN NaN NaN
# 7 2006 New York NaN NaN NaN NaN NaN
# 8 2007 New York NaN NaN NaN NaN NaN
# 9 2008 New York NaN NaN NaN NaN NaN