我有一个here
可以找到的输入数据示例我需要根据每行中的数据添加2列:“开始日期”和“结束日期”:
输出示例:
我将不胜感激:)谢谢
答案 0 :(得分:3)
使用pd.melt()
按ID和日期对数据进行排序
import pandas as pd
import numpy as np
from pandas.tseries.offsets import MonthEnd
df = pd.read_excel("input.xlsx")
max_date = df.columns[-1]
res = pd.melt(df, id_vars=['id', 'region'], value_vars=df.columns[2:])
res.dropna(subset=['value'], inplace=True)
res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')
minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)
df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')
df['end_date'] = np.where(df['end_date']==max_date,
"99991231",df['end_date'])
df['start_date'] = (pd.to_datetime(df['start_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = (pd.to_datetime(df['end_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = np.where(df['end_date']=='NaT',
"99991231",df['end_date'])
print(df)
id region 201801 201802 ... 201905 201906 start_date end_date
0 100001 628 NaN NaN ... 26.0 23.0 2018-09-30 99991231
1 100002 1149 27.0 24.0 ... 26.0 24.0 2018-01-31 99991231
2 100003 1290 26.0 26.0 ... 27.0 25.0 2018-01-31 99991231
3 100004 955 25.0 26.0 ... NaN NaN 2018-01-31 2018-12-31
4 100005 1397 15.0 25.0 ... NaN NaN 2018-01-31 2018-11-30
5 100006 1397 15.0 25.0 ... NaN NaN 2018-01-31 2019-02-28
答案 1 :(得分:2)
想法是通过DataFrame.set_index
将非类似日期时间的列转换为MultiIndex
,然后将列转换为日期时间:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
然后通过DataFrame.assign
创建新列,过滤January
列,比较非缺失值,并通过DataFrame.idxmax
获取第一个值,然后通过Series.dt.strftime
转换为{ {1}},对于begin
索引为end
的第一个掉期订单并获得最后一个非缺失值,如果最后一列的值不小于{{3, }}:
::-1
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
还可以在Series.where
和Timestamp.max
的两个新列中创建有效的数据时间:
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 20190101 99991231
1 24.0 20180101 99991231
2 25.0 20180101 99991231
3 NaN 20180101 20181231
4 NaN 20180101 20181130
5 NaN 20180101 20190228
[6 rows x 22 columns]
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()