Python Pandas:如何根据二元变量“结果”(0,1)对连续变量(A和B)进行离散化(bin)。最大二进制数应为5,每个变量bin应至少为结果变量类别“1”的5%。
我能够使用pandas函数qcut或cut来对变量进行分类,但我并不知道如何在对每个数值变量进行分箱时设置5%的条件。
例如:
bin total cnt 0's count 1's count
(0.92, 1.481] 100 80 20
(0.472, 0.92] 100 70 30
(-0.248, 0.472] 100 60 40
(-0.53, -0.248] 100 90 10
(-1.208, -0.53] 100 100 0
最后一个bin没有任何1的计数,并且它不满足5%的规则,在这种情况下,bin必须创建满足5个条件的5个bin或小于5个bin。
我已经提供了一个示例代码,我创建了使用qcut来存储数据,但我缺乏5%的逻辑。
如果有人帮我解决这个问题,我会很高兴,我也有超过1000个变量bin,我需要把条件应用于每个变量的binning。另请告诉我pandas中“qcut”和“cut”功能之间的主要区别。
import pandas as pd
import numpy as np
import random
from random import randint, choice
np.random.seed(1001)
df = pd.DataFrame(np.random.randn(25, 2), columns=['A', 'B'])
df['Outcome'] = [random.randint(0, 1) for _ in range(25)]
df['A_Bin'] = pd.qcut(df["A"], 5)
df['B_Bin'] = pd.qcut(df["B"], 5)
DF:
A B Outcome A_Bin B_Bin
0 -1.086446 -0.896065 0 (-1.208, -0.53] (-1.465, -0.811]
1 -0.306299 -1.339934 0 (-0.53, -0.248] (-1.465, -0.811]
2 -1.206586 -0.641727 1 (-1.208, -0.53] (-0.811, -0.0992]
3 1.307946 1.845460 0 (0.92, 1.481] (0.984, 1.845]
4 0.829115 -0.023299 0 (0.472, 0.92] (-0.0992, 0.984]
5 -0.208564 -0.916620 0 (-0.248, 0.472] (-1.465, -0.811]
6 -1.074743 -0.086143 0 (-1.208, -0.53] (-0.0992, 0.984]
7 1.175839 -1.635092 0 (0.92, 1.481] (-2.213, -1.465]
8 1.228194 1.076386 0 (0.92, 1.481] (0.984, 1.845]
9 0.394773 -0.387701 0 (-0.248, 0.472] (-0.811, -0.0992]
10 0.588402 -1.433299 0 (0.472, 0.92] (-1.465, -0.811]
11 -0.323575 1.252985 0 (-0.53, -0.248] (0.984, 1.845]
12 -0.730743 1.428485 0 (-1.208, -0.53] (0.984, 1.845]
13 0.944654 -0.264003 1 (0.92, 1.481] (-0.811, -0.0992]
14 -0.163213 -0.964107 0 (-0.248, 0.472] (-1.465, -0.811]
15 -0.342647 0.048904 0 (-0.53, -0.248] (-0.0992, 0.984]
16 -0.499632 -2.211905 1 (-0.53, -0.248] (-2.213, -1.465]
17 0.309837 -1.589651 1 (-0.248, 0.472] (-2.213, -1.465]
18 0.729455 -0.107901 1 (0.472, 0.92] (-0.811, -0.0992]
19 -0.650171 -1.693057 0 (-1.208, -0.53] (-2.213, -1.465]
20 -0.368474 -1.590372 1 (-0.53, -0.248] (-2.213, -1.465]
21 1.480506 0.474440 1 (0.92, 1.481] (-0.0992, 0.984]
22 0.913864 0.960470 1 (0.472, 0.92] (-0.0992, 0.984]
23 0.084543 1.717231 0 (-0.248, 0.472] (0.984, 1.845]
24 0.851893 -0.754885 1 (0.472, 0.92] (-0.811, -0.0992]
计数:
df['Outcome'].value_counts()
Outcome Count
0 16
1 9