我正在尝试从Google Spreadsheet中提取数据,该数据的格式类似于日历,以便将数据重新格式化以批量上传到我们在工作中使用的信息管理系统。最终CSV必须具有非常特定的格式,距离最终产品只有一步之遥。
我当前的数据框如下所示:
description event_type start_date end_date
Training *Required 6/06/2020
New Staff on duty *Required 6/12/2020
Orientation *Required 6/12/2020
Group 1 Closed Session *Required 6/12/2020
Group 1 Closed Session *Required 6/13/2020
Group 1 Closed Session *Required 6/14/2020
Group 1 Closed Session *Required 6/15/2020
Group 1 Closed Session *Required 6/16/2020
All Staff on duty *Required 6/19/2020
Group 1 Closed Session *Required 6/19/2020
Group 1 Closed Session *Required 6/20/2020
Group 1 Closed Session *Required 6/21/2020
Group 1 Closed Session *Required 6/22/2020
Consumer outreach orientation *Required 6/25/2020
Some event on just another day *Required 6/25/2020
All Staff Meeting *Required 6/28/2020
(以上只是整个数据集的重要部分。我也更改了数据的内容,因此我对这些描述不太抱歉)
我不是连续几天连续多次列出“第1组关闭的会话”,而是需要将这些日期用一行显示-第一天在“开始日期”列中,而最后一天在“结束日期”中柱。对于每组“第1组封闭的会话”,我也需要这样做,因为它们跨越两个不同的日期集。
这个例子是我想要实现的:
description event_type start_date end_date
Training *Required 6/06/2020
New Staff on duty *Required 6/12/2020
Orientation *Required 6/12/2020
Group 1 Closed Session *Required 6/12/2020 6/16/2020
All Staff on duty *Required 6/19/2020
Group 1 Closed Session *Required 6/19/2020 6/22/2020
Consumer outreach orientation *Required 6/25/2020
Some event on just another day *Required 6/25/2020
All Staff Meeting *Required 6/28/2020
而且,并不是所有连续列出的事件都具有相同的描述,所以我希望找到一个无关紧要的解决方案。
有什么想法或线索吗?感谢您的帮助。
答案 0 :(得分:0)
尝试:
df.groupby((df['description'] != df['description'].shift()).cumsum()).first()
输出:
description event_type start_date end_date
description
1 Training *Required 6/06/2020
2 New Staff on duty *Required 6/12/2020
3 Orientation *Required 6/12/2020
4 Group 1 Closed Session *Required 6/12/2020
5 All Staff on duty *Required 6/19/2020
6 Group 1 Closed Session *Required 6/19/2020
7 Consumer outreach orientation *Required 6/25/2020
8 Some event on just another day *Required 6/25/2020
9 All Staff Meeting *Required 6/28/2020
答案 1 :(得分:0)
您可以使用Scott Boston的同一groupby
来获取最后一行,然后将其重新加入以获取开始日期和结束日期吗?
g = df.groupby((df['description'] != df['description'].shift()).cumsum())
first_df = g.first()
first_df.index = first_df.index.set_names(['id'])
last_df = g['startdate'].agg({'end date' : 'last'})
last_df.index = last_df.index.set_names(['id'])
first_df.merge(last_df, left_index=True, right_index=True)
description event_type startdate end date
id
1 Training *Required 2020-06-06 2020-06-06
2 New Staff on duty *Required 2020-06-12 2020-06-12
3 Orientation *Required 2020-06-12 2020-06-12
4 Group 1 Closed Session *Required 2020-06-12 2020-06-16
5 All Staff on duty *Required 2020-06-19 2020-06-19
6 Group 1 Closed Session *Required 2020-06-19 2020-06-22
7 Consumer outreach orientation *Required 2020-06-25 2020-06-25
8 Some event on just another day *Required 2020-06-25 2020-06-25
9 All Staff Meeting *Required 2020-06-28 2020-06-28