给出是这样的数据框:
kind seen
0 tiger 2019-01-01
1 tiger 2019-01-02
2 bird 2019-01-03
3 whale 2019-01-04
4 bird 2019-01-05
5 tiger 2019-01-06
6 bird 2019-01-07
目标是根据动物的种类对数据框进行分组,并以两个最新日期作为列值:
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
我当前的解决方案效率很低,它是这样的:
1。创建数据框
import pandas as pd
data = {"kind": ["tiger", "tiger", "bird", "whale", "bird", "tiger", "bird"],
"seen": pd.date_range('2019-01-01', periods = 7)}
df = pd.DataFrame(data)
数据框:
kind seen
0 tiger 2019-01-01
1 tiger 2019-01-02
2 bird 2019-01-03
3 whale 2019-01-04
4 bird 2019-01-05
5 tiger 2019-01-06
6 bird 2019-01-07
2。使用groupby计算最新日期
df = df.groupby('kind')['seen'].nlargest(2)
数据框:
kind
bird 6 2019-01-07
4 2019-01-05
tiger 5 2019-01-06
1 2019-01-02
whale 3 2019-01-04
这是问题所在,MultiIndex
的第二级将日期的原始索引保留为值。
意思是,如果我现在df.unstack()
数据框看起来像这样:
1 3 4 5 6
kind
bird NaT NaT 2019-01-05 NaT 2019-01-07
tiger 2019-01-02 NaT NaT 2019-01-06 NaT
whale NaT 2019-01-04 NaT NaT NaT
目标是看起来像这样
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
3。以非常丑陋的方式转换数据框
我将MultiIndex的第二级更改为允许df.unstack()
像目标数据框一样堆叠数据框的值
# Keeping track of the latest animal seen
predecessor_id = None
counter = 1
result = list()
for row in df.index:
if predecessor_id != row[0]:
counter = 1
else:
counter += 1
result.append((row[0], counter))
predecessor_id = row[0]
df.index = pd.MultiIndex.from_tuples(result)
数据框:
bird 1 2019-01-07
2 2019-01-05
tiger 1 2019-01-06
2 2019-01-02
whale 1 2019-01-04
df.unstack
并重命名各列,然后为我们提供目标数据框:
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
毋庸置疑,此解决方案对于核心而言过于矫and且不合常规。
感谢您的时间和节日快乐!
答案 0 :(得分:1)
s = df.groupby('kind')['seen'].tail(2)
new_df = df.loc[df['seen'].isin(s)].groupby('kind').agg(['last','first'])
然后我们只需要删除first和last相同的值,表明原始数据帧中只有一个值。
new_df.columns = new_df.columns.droplevel()
new_df.loc[a['first'] == new_df['last'],'last'] = pd.NaT
new_df.columns = new_df.columns.map(lambda x : x + '_seen')
last_seen first_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale NaT 2019-01-04
答案 1 :(得分:1)
这是一种方法:
grp=df.groupby('kind')['seen'].nlargest(2).droplevel(1).to_frame()
grp=grp.set_index(grp.groupby(grp.index).cumcount(),append=True).unstack()
grp.columns=['last_seen','second_last_seen']
print(grp)
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
答案 2 :(得分:1)
您可以执行以下操作:
g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
df2['second_last_seen'] = g.nth(-2)
结果将是:
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
如果需要更多列,则可以使用以下解决方案:
g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
for k in range(2,4):
df2[str(k)+'_last_seen'] = g.nth(-k)
这将导致:
last_seen 2_last_seen 3_last_seen
kind
bird 2019-01-07 2019-01-05 2019-01-03
tiger 2019-01-06 2019-01-02 2019-01-01
whale 2019-01-04 NaT NaT
UPD:添加了“已查看”列排序,因为在一般情况下是必需的。谢谢@aitak
答案 3 :(得分:1)
另一个解决方案(如果“ seen”属于Timestamp dtype):
s=df.groupby("kind")["seen"].agg(lambda t: t.nlargest(2).to_list())
s
kind
bird [2019-01-07 00:00:00, 2019-01-05 00:00:00]
tiger [2019-01-06 00:00:00, 2019-01-02 00:00:00]
whale [2019-01-04 00:00:00]
Name: seen, dtype: object
pd.DataFrame( s.to_list(),index=s.index).rename(columns={0:"last_seen",1:"second_last_seen"})
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT