Question

Python Pandas：如何根据二元变量“结果”（0,1）对连续变量（A和B）进行离散化（bin）。最大二进制数应为5，每个变量bin应至少为结果变量类别“1”的5％。

我能够使用pandas函数qcut或cut来对变量进行分类，但我并不知道如何在对每个数值变量进行分箱时设置5％的条件。

例如：

bin               total cnt 0's count   1's count
(0.92, 1.481]       100       80           20
(0.472, 0.92]       100       70           30
(-0.248, 0.472]     100       60           40
(-0.53, -0.248]     100       90           10
(-1.208, -0.53]     100       100           0

最后一个bin没有任何1的计数，并且它不满足5％的规则，在这种情况下，bin必须创建满足5个条件的5个bin或小于5个bin。

我已经提供了一个示例代码，我创建了使用qcut来存储数据，但我缺乏5％的逻辑。

如果有人帮我解决这个问题，我会很高兴，我也有超过1000个变量bin，我需要把条件应用于每个变量的binning。另请告诉我pandas中“qcut”和“cut”功能之间的主要区别。

import pandas as pd
import numpy as np
import random
from random import randint, choice

np.random.seed(1001)
df = pd.DataFrame(np.random.randn(25, 2), columns=['A', 'B'])

df['Outcome'] = [random.randint(0, 1) for _ in range(25)]

df['A_Bin'] = pd.qcut(df["A"], 5)
df['B_Bin'] = pd.qcut(df["B"], 5)

DF：

     A           B       Outcome       A_Bin           B_Bin
0   -1.086446   -0.896065   0   (-1.208, -0.53]     (-1.465, -0.811]
1   -0.306299   -1.339934   0   (-0.53, -0.248]     (-1.465, -0.811]
2   -1.206586   -0.641727   1   (-1.208, -0.53]     (-0.811, -0.0992]
3   1.307946    1.845460    0   (0.92, 1.481]       (0.984, 1.845]
4   0.829115    -0.023299   0   (0.472, 0.92]       (-0.0992, 0.984]
5   -0.208564   -0.916620   0   (-0.248, 0.472]     (-1.465, -0.811]
6   -1.074743   -0.086143   0   (-1.208, -0.53]     (-0.0992, 0.984]
7   1.175839    -1.635092   0   (0.92, 1.481]       (-2.213, -1.465]
8   1.228194    1.076386    0   (0.92, 1.481]       (0.984, 1.845]
9   0.394773    -0.387701   0   (-0.248, 0.472]     (-0.811, -0.0992]
10  0.588402    -1.433299   0   (0.472, 0.92]       (-1.465, -0.811]
11  -0.323575   1.252985    0   (-0.53, -0.248]     (0.984, 1.845]
12  -0.730743   1.428485    0   (-1.208, -0.53]     (0.984, 1.845]
13  0.944654    -0.264003   1   (0.92, 1.481]       (-0.811, -0.0992]
14  -0.163213   -0.964107   0   (-0.248, 0.472]     (-1.465, -0.811]
15  -0.342647   0.048904    0   (-0.53, -0.248]     (-0.0992, 0.984]
16  -0.499632   -2.211905   1   (-0.53, -0.248]     (-2.213, -1.465]
17  0.309837    -1.589651   1   (-0.248, 0.472]     (-2.213, -1.465]
18  0.729455    -0.107901   1   (0.472, 0.92]       (-0.811, -0.0992]
19  -0.650171   -1.693057   0   (-1.208, -0.53]     (-2.213, -1.465]
20  -0.368474   -1.590372   1   (-0.53, -0.248]     (-2.213, -1.465]
21  1.480506    0.474440    1   (0.92, 1.481]       (-0.0992, 0.984]
22  0.913864    0.960470    1   (0.472, 0.92]       (-0.0992, 0.984]
23  0.084543    1.717231    0   (-0.248, 0.472]     (0.984, 1.845]
24  0.851893    -0.754885   1   (0.472, 0.92]       (-0.811, -0.0992]

计数：

df['Outcome'].value_counts()

Outcome Count
 0       16
 1        9

如何根据二进制“结果”（0,1）对连续变量进行分类，bin应至少为结果“1”的5％

0 个答案: