“动态”列选择

时间:2021-01-15 16:09:37

标签: python pandas data-science

问题: 比方说,输入表是通话和帐单的合并表,具有以下列:通话时间和所有帐单的月份。这个想法是有一个表格,其中包含从通话时间开始的人支付的最后 3 笔账单。这样就可以把账单放在通话的上下文中。

示例输入和输出:

# INPUT:
# df
# TIME        ID   2019-08-01   2019-09-01   2019-10-01   2019-11-01   2019-12-01
# 2019-12-01  1    1            2            3            4            5
# 2019-11-01  2    6            7            8            9            10
# 2019-10-01  3    11           12           13           14           15

# EXPECTED OUTPUT:
# df_context
# TIME        ID   0     1     2
# 2019-12-01  1    3     4     5
# 2019-11-01  2    7     8     9
# 2019-10-01  3    11    12    13

示例输入创建:

df = pd.DataFrame({
    'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
    'ID':   [1,2,3],
    '2019-08-01':   [1,6,11],
    '2019-09-01':   [2,7,12],
    '2019-10-01':   [3,8,13],
    '2019-11-01':   [4,9,14],
    '2019-12-01':   [5,10,15],
})

我目前得到的代码:

# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3

df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()

我的代码输出:

# OUTPUTS:
#   TIME        0   1   2
# 0 2019-12-01  2   3   4    should be  3     4     5
# 1 2019-11-01  7   8   9    all good
# 2 2019-10-01  12  13  14   should be  11    12    13

如果一个或两个 for 循环,对于前两行代码,我的代码似乎缺少什么来做我想要它做的事情,但我不敢相信没有比这更好的解决方案我现在正在制作的那个。

2 个答案:

答案 0 :(得分:0)

我建议您执行以下步骤,以便您可以完全避免动态列选择。

  1. 将宽表(参考日期为列)转换为长表(参考日期为行)
  2. 计算通话时间 TIME 与参考日期之间的月差
  3. 仅选择带有 difference >= 0difference < 3 的那些
  4. 根据您的要求格式化输出表(添加一个运行数字,旋转它)
# Initialize dataframe
df = pd.DataFrame({
    'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
    'ID':   [1,2,3],
    '2019-08-01':   [1,6,11],
    '2019-09-01':   [2,7,12],
    '2019-10-01':   [3,8,13],
    '2019-11-01':   [4,9,14],
    '2019-12-01':   [5,10,15],
})

# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL

date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')

# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])

# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()

# Keep only the preceding 3 months (including the month = TIME)
selection = (
    (df['TIME_DIFF'] < 3) &
    (df['TIME_DIFF'] >= 0)
)

# Apply selection, sort the columns and keep only columns needed
df_out = (
    df[selection]
    .sort_values(['TIME','ID','REF_TIME'])
    [['TIME','ID','BILL']]
)

# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)

# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')

输出:

BILL_NO         1   2   3
ID  TIME            
1   2019-12-01  3   4   5
2   2019-11-01  7   8   9
3   2019-10-01  11  12  13

答案 1 :(得分:0)

这是我的(新手)解决方案,只有当列名中的日期按升序排列时它才会起作用:

# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID':   [1,2,3],
'2019-08-01':   [1,6,11],
'2019-09-01':   [2,7,12],
'2019-10-01':   [3,8,13],
'2019-11-01':   [4,9,14],
'2019-12-01':   [5,10,15],})


cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:

 for i in range(len(df)):
    searched_date = df.iloc[i, 0] 
    searched_column_index = cols.index(searched_date)
    searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
    mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
    searched_df = searched_row.rename(mapping_column_names, axis=1)
    new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df

输出:

     TIME      ID   0   1   2
0  2019-12-01   1   3   4   5
1  2019-11-01   2   7   8   9
2  2019-10-01   3  11  12  13

无论如何我认为@Toukenize 解决方案更好,因为它不需要迭代。