问题: 比方说,输入表是通话和帐单的合并表,具有以下列:通话时间和所有帐单的月份。这个想法是有一个表格,其中包含从通话时间开始的人支付的最后 3 笔账单。这样就可以把账单放在通话的上下文中。
示例输入和输出:
# INPUT:
# df
# TIME ID 2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01
# 2019-12-01 1 1 2 3 4 5
# 2019-11-01 2 6 7 8 9 10
# 2019-10-01 3 11 12 13 14 15
# EXPECTED OUTPUT:
# df_context
# TIME ID 0 1 2
# 2019-12-01 1 3 4 5
# 2019-11-01 2 7 8 9
# 2019-10-01 3 11 12 13
示例输入创建:
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
我目前得到的代码:
# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3
df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()
我的代码输出:
# OUTPUTS:
# TIME 0 1 2
# 0 2019-12-01 2 3 4 should be 3 4 5
# 1 2019-11-01 7 8 9 all good
# 2 2019-10-01 12 13 14 should be 11 12 13
如果一个或两个 for 循环,对于前两行代码,我的代码似乎缺少什么来做我想要它做的事情,但我不敢相信没有比这更好的解决方案我现在正在制作的那个。
答案 0 :(得分:0)
我建议您执行以下步骤,以便您可以完全避免动态列选择。
TIME
与参考日期之间的月差difference >= 0
和 difference < 3
的那些# Initialize dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL
date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')
# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])
# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()
# Keep only the preceding 3 months (including the month = TIME)
selection = (
(df['TIME_DIFF'] < 3) &
(df['TIME_DIFF'] >= 0)
)
# Apply selection, sort the columns and keep only columns needed
df_out = (
df[selection]
.sort_values(['TIME','ID','REF_TIME'])
[['TIME','ID','BILL']]
)
# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)
# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')
输出:
BILL_NO 1 2 3
ID TIME
1 2019-12-01 3 4 5
2 2019-11-01 7 8 9
3 2019-10-01 11 12 13
答案 1 :(得分:0)
这是我的(新手)解决方案,只有当列名中的日期按升序排列时它才会起作用:
# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],})
cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:
for i in range(len(df)):
searched_date = df.iloc[i, 0]
searched_column_index = cols.index(searched_date)
searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
searched_df = searched_row.rename(mapping_column_names, axis=1)
new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df
输出:
TIME ID 0 1 2
0 2019-12-01 1 3 4 5
1 2019-11-01 2 7 8 9
2 2019-10-01 3 11 12 13
无论如何我认为@Toukenize 解决方案更好,因为它不需要迭代。