我正在尝试复制一种将记录归因于decil的特定方法,并且有一个pandas.qcut()
函数可以很好地完成工作。我唯一担心的是,没有一种方法可以像我要复制的方法所示将不均匀数归因于特定的bin。
这是我的示例:
num = np.random.rand(153, 1)
my_list = map(lambda x: x[0], num)
ser = pd.Series(my_list)
bins = pd.qcut(ser, 10, labels=False)
bins.value_counts()
哪个输出:
9 16
4 16
0 16
8 15
7 15
6 15
5 15
3 15
2 15
1 15
有7个和15个,有3个和16个,我想做的是指定将接收16条记录的垃圾箱:
9 16 <
4 16
0 16
8 15
7 15
6 15
5 15 <
3 15
2 15 <
1 15
可以使用pd.qcut
吗?
答案 0 :(得分:0)
由于没有答案,问了几个人似乎不太可能,所以我拼凑了一个执行此操作的函数:
def defined_qcut(df, value_series, number_of_bins, bins_for_extras, labels=False):
if max(bins_for_extras) > number_of_bins or any(x < 0 for x in bins_for_extras):
raise ValueError('Attempted to allocate to a bin that doesnt exist')
base_number, number_of_values_to_allocate = divmod(df[value_series].count(), number_of_bins)
bins_for_extras = bins_for_extras[:number_of_values_to_allocate]
if number_of_values_to_allocate == 0:
df['bins'] = pd.qcut(df[value_series], number_of_bins, labels=labels)
return df
elif number_of_values_to_allocate > len(bins_for_extras):
raise ValueError('There are more values to allocate than the list provided, please select more bins')
bins = {}
for i in range(number_of_bins):
number_of_values_in_bin = base_number
if i in bins_for_extras:
number_of_values_in_bin += 1
bins[i] = number_of_values_in_bin
df1 = df.copy()
df1['rank'] = df1[value_series].rank()
df1 = df1.sort_values(by=['rank'])
df1['bins'] = 0
row_to_start_allocate = 0
row_to_end_allocate = 0
for bin_number, number_in_bin in bins.items():
row_to_end_allocate += number_in_bin
bins.update({bin_number: [number_in_bin, row_to_start_allocate, row_to_end_allocate]})
row_to_start_allocate = row_to_end_allocate
conditions = [df1['rank'].iloc[v[1]: v[2]] for k, v in bins.items()]
series_to_add = pd.Series()
for idx, series in enumerate(conditions):
series[series > -1] = idx
series_to_add = series_to_add.append(series)
df1['bins'] = series_to_add
df1 = df1.reset_index()
return df1
这不是很漂亮,但是可以完成工作。您传入数据框,带有值的列的名称以及应在其中分配任何额外值的bin的有序列表。我很乐意就如何改进此代码提供一些建议。