我有这张桌子,我想将两行合并在一起,其中
end_date[-1]= start_date[0] AND trained[-1] = trained[0]
可选:我还想保留diff最大的re,id和name的值。
use trained start end diff re id name
a FALSE 01/12/2010 03/01/2018 2,590 0 4 25
a TRUE 03/01/2018 08/02/2019 401 0 4 25
a TRUE 08/02/2019 09/02/2019 1 0 4 25
a TRUE 09/02/2019 31/12/2019 325 1 4 25
b FALSE 01/08/2016 15/05/2018 652 0 5 8
c FALSE 01/07/2019 06/08/2019 36 0 4 4
c TRUE 06/08/2019 18/05/2020 286 0 4 4
c TRUE 18/05/2020 19/05/2020 1 0 4 4
c TRUE 19/05/2020 01/09/2020 105 0 4 4
c TRUE 01/09/2020 31/12/2019 (245) 1 4 15
目标:
use trained start end diff re id name
a FALSE 01/12/2010 03/01/2018 2,590 0 4 25
a TRUE 03/01/2018 31/12/2019 727 0 4 25
b FALSE 01/08/2016 15/05/2018 652 0 5 8
c FALSE 01/07/2019 06/08/2019 36 0 4 4
c TRUE 06/08/2019 31/12/2019 147 0 4 4
答案 0 :(得分:1)
根据您的逻辑,我们可以在否定条件下使用cumsum()
来识别块。然后我们可以使用groupby
:
blocks = (df['trained'].ne(df['trained'].shift())
|df['start'].ne(df['end'].shift())
).cumsum()
df.groupby([blocks,'use']).agg({ # change the functions to fit your need
'trained':'first',
'start':'first',
'end':'last',
'diff':'sum',
're':'min',
'id':'first',
'name':'first'
}).reset_index('use')
use trained start end diff re id name
1 a False 01/12/2010 03/01/2018 2,590 0 4 25
2 a True 03/01/2018 31/12/2019 4011325 0 4 25
3 b False 01/08/2016 15/05/2018 652 0 5 8
4 c False 01/07/2019 06/08/2019 36 0 4 4
5 c True 06/08/2019 31/12/2019 2861105(245) 0 4 4
答案 1 :(得分:1)
听起来您只是希望groupby中的“开始”的first
值和“结束”的last
值:
假设您的数据帧称为df
:
grouped = df.groupby(['use', 'trained'], as_index=False).agg({
'start': 'first',
'end': 'last'})
您可以再次使用groupby
来获取每组“使用”和“训练有素”的“ diff”最大值的索引。
max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values
现在,您可以获取字段're','id','name'的值:
re_id_name_df = df.loc[df.index.isin(max_idx_values),
['use', 'trained', 're', 'id', 'name']]
最后,您可以将两个结果合并在一起,以将所有结果合并到一个数据框中:
final = grouped.merge(re_id_name_df, on=['use', 'trained'])
所有代码都放在一个块中:
grouped = df.groupby(['use', 'trained'], as_index=False).agg({
'start': 'first',
'end': 'last'})
max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values
re_id_name_df = df.loc[df.index.isin(max_idx_values),
['use', 'trained', 're', 'id', 'name']]
final = grouped.merge(re_id_name_df, on=['use', 'trained'])
print(final)
use trained start end re id name
0 a False 2010-01-12 2018-03-01 0 4 25
1 a True 2018-03-01 2019-12-31 0 4 25
2 b False 2016-01-08 2018-05-15 0 5 8
3 c False 2019-01-07 2019-06-08 0 4 4
4 c True 2019-06-08 2019-12-31 0 4 4