基于行中值的“开始”和“结束”日期

时间:2019-07-08 10:19:14

标签: python pandas

我有一个here

可以找到的输入数据示例

输入 enter image description here

我需要根据每行中的数据添加2列:“开始日期”和“结束日期”:

  • 开始日期-当所有以前的单元格都为空并且日期从yyyymm01开始时
  • 结束日期-当随后的所有为空时:
  • 如果所有后续项都不为空,则添加类似“终身”的日期:“ 99991231”
  • 否则-yyyymm30或31或28(取决于月份)

输出示例:

enter image description here

我将不胜感激:)谢谢

2 个答案:

答案 0 :(得分:3)

使用pd.melt()
按ID和日期对数据进行排序

import pandas as pd
import numpy as np
from pandas.tseries.offsets import MonthEnd

df = pd.read_excel("input.xlsx")
max_date = df.columns[-1]

res = pd.melt(df, id_vars=['id', 'region'], value_vars=df.columns[2:])
res.dropna(subset=['value'], inplace=True)

res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)

minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')

minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)

df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')

df['end_date'] = np.where(df['end_date']==max_date,
                          "99991231",df['end_date'])

df['start_date'] = (pd.to_datetime(df['start_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = (pd.to_datetime(df['end_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)

df['end_date'] = np.where(df['end_date']=='NaT',
                          "99991231",df['end_date'])
print(df)

      id  region  201801  201802  ...  201905  201906  start_date    end_date
0  100001     628     NaN     NaN  ...    26.0    23.0  2018-09-30    99991231
1  100002    1149    27.0    24.0  ...    26.0    24.0  2018-01-31    99991231
2  100003    1290    26.0    26.0  ...    27.0    25.0  2018-01-31    99991231
3  100004     955    25.0    26.0  ...     NaN     NaN  2018-01-31  2018-12-31
4  100005    1397    15.0    25.0  ...     NaN     NaN  2018-01-31  2018-11-30
5  100006    1397    15.0    25.0  ...     NaN     NaN  2018-01-31  2019-02-28

答案 1 :(得分:2)

想法是通过DataFrame.set_index将非类似日期时间的列转换为MultiIndex,然后将列转换为日期时间:

df = pd.read_excel('input.xlsx')

df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')

然后通过DataFrame.assign创建新列,过滤January列,比较非缺失值,并通过DataFrame.idxmax获取第一个值,然后通过Series.dt.strftime转换为{ {1}},对于begin索引为end的第一个掉期订单并获得最后一个非缺失值,如果最后一列的值不小于{{3, }}:

::-1

begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()

end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')

df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end =  end).reset_index()

还可以在Series.whereTimestamp.max的两个新列中创建有效的数据时间:

print (df)
       id  region  201801  201802  201803  201804  201805  201806  201807  \
0  100001     628     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
1  100002    1149    27.0    24.0    27.0    25.0    24.0    26.0    27.0   
2  100003    1290    26.0    26.0    26.0    26.0    23.0    27.0    27.0   
3  100004     955    25.0    26.0    26.0    24.0    24.0    26.0    28.0   
4  100005    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   
5  100006    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   

   201808  ...  201811  201812  201901  201902  201903  201904  201905  \
0     NaN  ...      24    20.0    26.0    24.0    26.0    26.0    26.0   
1    28.0  ...      24    21.0    26.0    25.0    27.0    24.0    26.0   
2     NaN  ...      28     NaN    28.0    26.0    27.0    27.0    27.0   
3    27.0  ...      24    12.0     NaN     NaN     NaN     NaN     NaN   
4    26.0  ...      25     NaN     NaN     NaN     NaN     NaN     NaN   
5    26.0  ...      25    23.0    25.0    17.0     NaN     NaN     NaN   

   201906  date_begin  date_end  
0    23.0    20190101  99991231  
1    24.0    20180101  99991231  
2    25.0    20180101  99991231  
3     NaN    20180101  20181231  
4     NaN    20180101  20181130  
5     NaN    20180101  20190228  

[6 rows x 22 columns]

df = pd.read_excel('input.xlsx')

df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')

begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()

end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))

df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()