通过GroupBy Operations有效创建列

时间:2019-12-25 13:43:26

标签: python pandas pandas-groupby multi-index

给出是这样的数据框:

   kind       seen
0  tiger 2019-01-01
1  tiger 2019-01-02
2   bird 2019-01-03
3  whale 2019-01-04
4   bird 2019-01-05
5  tiger 2019-01-06
6   bird 2019-01-07

目标是根据动物的种类对数据框进行分组,并以两个最新日期作为列值:

      last_seen   second_last_seen
bird  2019-01-07  2019-01-05
tiger 2019-01-06  2019-01-02
whale 2019-01-04         NaT

我当前的解决方案效率很低,它是这样的:

1。创建数据框

import pandas as pd
data = {"kind": ["tiger", "tiger", "bird", "whale", "bird", "tiger", "bird"], 
        "seen": pd.date_range('2019-01-01', periods = 7)}
df = pd.DataFrame(data)

数据框:

   kind       seen
0  tiger 2019-01-01
1  tiger 2019-01-02
2   bird 2019-01-03
3  whale 2019-01-04
4   bird 2019-01-05
5  tiger 2019-01-06
6   bird 2019-01-07

2。使用groupby计算最新日期

df = df.groupby('kind')['seen'].nlargest(2)

数据框:

kind    
bird   6   2019-01-07
       4   2019-01-05
tiger  5   2019-01-06
       1   2019-01-02
whale  3   2019-01-04

这是问题所在,MultiIndex的第二级将日期的原始索引保留为值。

意思是,如果我现在df.unstack()数据框看起来像这样:

               1          3          4          5          6
kind                                                        
bird         NaT        NaT 2019-01-05        NaT 2019-01-07
tiger 2019-01-02        NaT        NaT 2019-01-06        NaT
whale        NaT 2019-01-04        NaT        NaT        NaT

目标是看起来像这样

      last_seen   second_last_seen
bird  2019-01-07  2019-01-05
tiger 2019-01-06  2019-01-02
whale 2019-01-04         NaT

3。以非常丑陋的方式转换数据框

我将MultiIndex的第二级更改为允许df.unstack()像目标数据框一样堆叠数据框的值

# Keeping track of the latest animal seen
predecessor_id = None
counter = 1
result = list()

for row in df.index:
    if predecessor_id != row[0]:
        counter = 1
    else:
        counter += 1
    result.append((row[0], counter))
    predecessor_id = row[0]

df.index = pd.MultiIndex.from_tuples(result)

数据框:

bird   1   2019-01-07
       2   2019-01-05
tiger  1   2019-01-06
       2   2019-01-02
whale  1   2019-01-04

df.unstack并重命名各列,然后为我们提供目标数据框:

      last_seen   second_last_seen
bird  2019-01-07  2019-01-05
tiger 2019-01-06  2019-01-02
whale 2019-01-04         NaT

毋庸置疑,此解决方案对于核心而言过于矫and且不合常规。

感谢您的时间和节日快乐!

4 个答案:

答案 0 :(得分:1)

s = df.groupby('kind')['seen'].tail(2)
new_df = df.loc[df['seen'].isin(s)].groupby('kind').agg(['last','first'])

然后我们只需要删除first和last相同的值,表明原始数据帧中只有一个值。

new_df.columns = new_df.columns.droplevel()
new_df.loc[a['first'] == new_df['last'],'last'] = pd.NaT
new_df.columns = new_df.columns.map(lambda x : x + '_seen')

       last_seen first_seen
kind                       
bird  2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale        NaT 2019-01-04

答案 1 :(得分:1)

这是一种方法:

grp=df.groupby('kind')['seen'].nlargest(2).droplevel(1).to_frame()
grp=grp.set_index(grp.groupby(grp.index).cumcount(),append=True).unstack()

grp.columns=['last_seen','second_last_seen']
print(grp)

       last_seen second_last_seen
kind                             
bird  2019-01-07       2019-01-05
tiger 2019-01-06       2019-01-02
whale 2019-01-04              NaT

答案 2 :(得分:1)

您可以执行以下操作:

g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
df2['second_last_seen'] = g.nth(-2)

结果将是:

       last_seen second_last_seen
kind                             
bird  2019-01-07       2019-01-05
tiger 2019-01-06       2019-01-02
whale 2019-01-04              NaT

如果需要更多列,则可以使用以下解决方案:

g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
for k in range(2,4):
    df2[str(k)+'_last_seen'] = g.nth(-k)

这将导致:

       last_seen 2_last_seen 3_last_seen
kind                                    
bird  2019-01-07  2019-01-05  2019-01-03
tiger 2019-01-06  2019-01-02  2019-01-01
whale 2019-01-04         NaT         NaT

UPD:添加了“已查看”列排序,因为在一般情况下是必需的。谢谢@aitak

答案 3 :(得分:1)

另一个解决方案(如果“ seen”属于Timestamp dtype):

s=df.groupby("kind")["seen"].agg(lambda t: t.nlargest(2).to_list())                                                  

s                                                                                                                    

kind
bird     [2019-01-07 00:00:00, 2019-01-05 00:00:00]
tiger    [2019-01-06 00:00:00, 2019-01-02 00:00:00]
whale                         [2019-01-04 00:00:00]
Name: seen, dtype: object

pd.DataFrame( s.to_list(),index=s.index).rename(columns={0:"last_seen",1:"second_last_seen"})                        

       last_seen second_last_seen
kind                             
bird  2019-01-07       2019-01-05
tiger 2019-01-06       2019-01-02
whale 2019-01-04              NaT