Question

我有一个不平衡的Pandas MultiIndex DataFrame，其中每一行都存储一个firm-year观察值。采样期（变量year）的范围为2013年至2017年。数据集包含变量event，如果在给定的1中发生了事件，则将其设置为year。

样本数据集：

#Create dataset
import pandas as pd

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})

df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)

我想基于现有列status创建一个新列event，如下所示：每当事件在列event中第一次发生时，{{1}的值}列应在以后的所有年份（包括事件发生的年份）从status更改为0。

具有预期变量1的DataFrame：

status

到目前为止，我还没有找到任何有用的解决方案，因此，任何建议都将不胜感激。谢谢！

Answer 1

我们可以在索引（id）的第一级上groupby，然后将eq的所有行标记为1。然后使用cumsum还将True转换为1并将False转换为0：

df['status'] = df.groupby(level=0).apply(lambda x: x.eq(1).cumsum())

输出

         event  status
id year               
1  2013      1       1
   2014      0       1
   2015      0       1
   2016      0       1
   2017      0       1
2  2014      0       0
   2015      0       0
   2016      1       1
   2017      0       1
3  2016      1       1
   2017      0       1
4  2013      0       0
   2014      1       1
   2015      0       1
5  2014      0       0
   2015      0       0
   2016      0       0
   2017      1       1

Answer 2

密钥是在cumsum下使用groupby

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})


(df.assign(status = lambda x: x.event.eq(1).mul(1).groupby(x['id']).cumsum())
   .set_index(['id','year']))

输出

        event   status
id  year        
1   2013    1   1
    2014    0   1
    2015    0   1
    2016    0   1
    2017    0   1
2   2014    0   0
    2015    0   0
    2016    1   1
    2017    0   1
3   2016    1   1
    2017    0   1
4   2013    0   0
    2014    1   1
    2015    0   1
5   2014    0   0
    2015    0   0
    2016    0   0
    2017    1   1

Answer 3

带有段落的基本答案说明：

import pandas as pd

df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
                   'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
                             2016,2017,2013,2014,2015,2014,2015,2016,2017],
                   'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})


# extract unique IDs as list
ids = list(set(df["id"]))

# initialize a list to keep the results
list_event_years =[]
#open a loop on IDs
for id in ids :
    # set happened to 0
    event_happened = 0
    # open a loop on DF pertaining to the actual ID
    for index, row in df[df["id"] == id].iterrows() :
        # if event happened set the variable to 1
        if row["event"] == 1 :
            event_happened = 1
        # add the var to the list of results
        list_event_years.append(event_happened)

# add the list of results as DF column
df["event-happened"] = list_event_years

### OUTPUT
>>> df
    id  year  event  event-year
0    1  2013      1           1
1    1  2014      0           1
2    1  2015      0           1
3    1  2016      0           1
4    1  2017      0           1
5    2  2014      0           0
6    2  2015      0           0
7    2  2016      1           1
8    2  2017      0           1
9    3  2016      1           1
10   3  2017      0           1
11   4  2013      0           0
12   4  2014      1           1
13   4  2015      0           1
14   5  2014      0           0
15   5  2015      0           0
16   5  2016      0           0
17   5  2017      1           1

，如果您需要像示例中那样对它们进行索引，请执行以下操作：

df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)

### OUTPUT
>>> df
         event  event-year
id year                   
1  2013      1           1
   2014      0           1
   2015      0           1
   2016      0           1
   2017      0           1
2  2014      0           0
   2015      0           0
   2016      1           1
   2017      0           1
3  2016      1           1
   2017      0           1
4  2013      0           0
   2014      1           1
   2015      0           1
5  2014      0           0
   2015      0           0
   2016      0           0
   2017      1           1

MultiIndex DataFrame：如何基于其他列中的值创建新列？

3 个答案: