分组并转换熊猫

时间:2020-04-17 03:06:42

标签: python pandas numpy pandas-groupby

样本DF:

 sample_df = pd.DataFrame(np.random.randint(1,20,size=(10, 2)), columns=list('BC'))
sample_df["date"]= ["2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01",
                    "2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"]
sample_df["date"] = pd.to_datetime(sample_df["date"])
sample_df.set_index(sample_df["date"],inplace=True)
sample_df["A"]=[10,10,10,10,10,12,1,3,4,2]
del sample_df["date"]
sample_df

样本DF:

            B   C   A
date                  
2020-02-01  19  12  10
2020-02-01  11   1  10
2020-02-01  10   1  10
2020-02-01  13   4  10
2020-02-01   5  15  10
2020-02-02  10   3  12
2020-02-02   3   7   1
2020-02-02   6  13   3
2020-02-02  17  10   4
2020-02-02  15   1   2

条件:

Group by index,然后在列pandas上应用A分位数切割,如果有错误,则对mean(col A and col C)应用分位数切割

try:
    Quantile cut column A
except:
    quantile cut mean(col A and col C)

代码:

def func(df,n_bins):
    try:
        proc_col = pd.qcut(df["A"].values, n_bins, labels=range(0,n_bins))
        return proc_col
    except:
        proc_col = pd.qcut(df.mean(axis =1).values, n_bins, labels=range(0,n_bins))
        return proc_col

sample_df["A"]=sample_df.groupby([sample_df.index.get_level_values(0)])[["C","A"]].apply(lambda df: func(df,3))
sample_df

OP:

            B   C     A
date            
2020-02-01  1   16  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  5   19  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  2   16  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  12  11  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-01  15  10  [1, 2, 1, 0, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  19  17  [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  17  7   [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  14  1   [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  19  13  [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...
2020-02-02  15  13  [2, 0, 1, 2, 0] Categories (3, int64): [0 < 1 ...

期望的操作次数:

            B   C    A
date            
2020-02-01  1   16   1
2020-02-01  5   19   2
2020-02-01  2   16   1
2020-02-01  12  11   0
2020-02-01  15  10   0
2020-02-02  19  17   2
2020-02-02  17  7    0
2020-02-02  14  1    1
2020-02-02  19  13   2
2020-02-02  15  13   0

任何有关该错误的建议都会很棒。我尝试用transform代替apply,但这给了我一个错误。

1 个答案:

答案 0 :(得分:1)

使用变换到序列,进行堆栈,以便将两个序列都附加到具有相应索引和下降级别的长序列中,以固定两个级别的索引。

sample_df["A"]=sample_df.groupby([sample_df.index.get_level_values(0)])[["C","A"]].apply(lambda df: func(df,3)).transform(pd.Series).stack().droplevel(1)