我有一个不平衡的Pandas MultiIndex DataFrame,其中每一行都存储一个firm-year
观察值。采样期(变量year
)的范围为2013年至2017年。数据集包含变量event
,如果在给定的1
中发生了事件,则将其设置为year
。
样本数据集:
#Create dataset
import pandas as pd
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)
我想基于现有列status
创建一个新列event
,如下所示:每当事件在列event
中第一次发生时,{{1}的值}列应在以后的所有年份(包括事件发生的年份)从status
更改为0
。
具有预期变量1
的DataFrame:
status
到目前为止,我还没有找到任何有用的解决方案,因此,任何建议都将不胜感激。谢谢!
答案 0 :(得分:3)
我们可以在索引(id)的第一级上groupby
,然后将eq
的所有行标记为1。然后使用cumsum
还将True
转换为1
并将False
转换为0
:
df['status'] = df.groupby(level=0).apply(lambda x: x.eq(1).cumsum())
输出
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
答案 1 :(得分:0)
密钥是在cumsum
下使用groupby
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
(df.assign(status = lambda x: x.event.eq(1).mul(1).groupby(x['id']).cumsum())
.set_index(['id','year']))
输出
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
答案 2 :(得分:0)
带有段落的基本答案说明:
import pandas as pd
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
# extract unique IDs as list
ids = list(set(df["id"]))
# initialize a list to keep the results
list_event_years =[]
#open a loop on IDs
for id in ids :
# set happened to 0
event_happened = 0
# open a loop on DF pertaining to the actual ID
for index, row in df[df["id"] == id].iterrows() :
# if event happened set the variable to 1
if row["event"] == 1 :
event_happened = 1
# add the var to the list of results
list_event_years.append(event_happened)
# add the list of results as DF column
df["event-happened"] = list_event_years
### OUTPUT
>>> df
id year event event-year
0 1 2013 1 1
1 1 2014 0 1
2 1 2015 0 1
3 1 2016 0 1
4 1 2017 0 1
5 2 2014 0 0
6 2 2015 0 0
7 2 2016 1 1
8 2 2017 0 1
9 3 2016 1 1
10 3 2017 0 1
11 4 2013 0 0
12 4 2014 1 1
13 4 2015 0 1
14 5 2014 0 0
15 5 2015 0 0
16 5 2016 0 0
17 5 2017 1 1
,如果您需要像示例中那样对它们进行索引,请执行以下操作:
df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)
### OUTPUT
>>> df
event event-year
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1