我正在尝试按计数(按计数)平均分割Xs(连续)和Ys(二进制),直到找到“断点”。例如,以下代码应生成5,000个观察值,每个观察值的比例分别为0和1。然后,我想用更大的1s比例拆分一半,依此类推,直到没有办法拆分为止。
编辑:我的数据不是正态分布的,但对于此示例,我不得不生成伪造的数据。
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
random.seed(191)
df = pd.DataFrame( np.random.randint( 0,2,size = ( 5000,1 ) ), columns = list( 'Y' ) )
df['X'] = pd.Series( random.choices( range( 5000 ), k = 5000) )
# Creating equal-sized bins
df['bins'] = pd.qcut( df['X'], 2 )
print( df.groupby('bins')['Y'].value_counts() )
print( df.groupby('bins')['Y'].mean() )
# Next I want to take the bins with the larger proportion of 1s and repeat the qcut until a minimum/maximum(?) is reached
答案 0 :(得分:0)
您可以使用代码执行所需的操作:
import numpy as np
import pandas as pd
import random
SIZE = 5000
df = pd.DataFrame(np.random.randint(0, 2, size=(SIZE, 1)), columns=list('Y'))
df['X'] = pd.Series(random.choices(range(5000), k=SIZE))
def splitting(df):
# base case - no way to split anymore - only 0s or only 1s are in 'Y'
if df['Y'].unique().shape[0] == 1:
return df
# recursion
else:
df['bins'] = pd.qcut(df['X'], 2)
label = df.groupby('bins')['Y'].mean().idxmax()
df_1 = df[df['bins'] != label].copy()
df_2 = df[df['bins'] == label].copy()
return pd.concat([df_1, splitting(df_2)])
result = splitting(df)