给定以下数据框:
人 | 2018-01 | 2018-02 | 2018-03 | 2018-04 | 2018-05 | 2018-06 | 2018-07 |
---|---|---|---|---|---|---|---|
p1 | y | y | y | y | y | ||
p2 | y | y | y | y |
我想返回连续“y”天的开始日期和结束日期如下:
人 | 开始日期 | 结束日期 |
---|---|---|
p1 | 20180201 | 20180331 |
p1 | 20180501 | 20180731 |
p2 | 20180101 | 20180228 |
p2 | 20180401 | 20180531 |
答案 0 :(得分:1)
假设您正在从 excel 加载数据:
import pandas as pd
# Input data prep
data = pd.read_excel('data.xlsx')
data = data.T
data.reset_index(inplace=True)
# Setting the proper header
new_header = data.iloc[0]
data = data[1:]
data.columns = new_header
# Easy to work with 1 and 0 for consecutives with cumsum
data = data.fillna(0)
data = data.replace("y", 1)
df_result = pd.DataFrame() # Store your desired table
for column in data.columns[1:]: # per person iteration
df_temp = data[["person", column]]
df_temp['consecutive'] = (df_temp[column].diff(1) != 0).cumsum()
df_temp = df_temp[df_temp[column] > 0]
df_temp = pd.DataFrame({
'person': column,
'start_date': df_temp.groupby('consecutive')["person"].first(),
'end_date': df_temp.groupby('consecutive')["person"].last()
}).reset_index(drop=True)
df_result = df_result.append(df_temp)
# First and last day of month
df_result['start_date'] = df_result['start_date'].values.astype('datetime64[M]')
df_result['end_date'] = pd.to_datetime(df_result['end_date']) + MonthEnd(1)
print(df_result)