将余数分配给pandas.qcut()中的特定垃圾箱

时间:2018-07-04 10:47:00

标签: python pandas

我正在尝试复制一种将记录归因于decil的特定方法,并且有一个pandas.qcut()函数可以很好地完成工作。我唯一担心的是,没有一种方法可以像我要复制的方法所示将不均匀数归因于特定的bin。

这是我的示例:

num = np.random.rand(153, 1)
my_list = map(lambda x: x[0], num)
ser = pd.Series(my_list)
bins = pd.qcut(ser, 10, labels=False)
bins.value_counts()

哪个输出:

9    16
4    16
0    16
8    15
7    15
6    15
5    15
3    15
2    15
1    15

有7个和15个,有3个和16个,我想做的是指定将接收16条记录的垃圾箱:

9    16 <
4    16
0    16
8    15
7    15
6    15
5    15 <
3    15
2    15 <
1    15

可以使用pd.qcut吗?

1 个答案:

答案 0 :(得分:0)

由于没有答案,问了几个人似乎不太可能,所以我拼凑了一个执行此操作的函数:

 def defined_qcut(df, value_series, number_of_bins, bins_for_extras, labels=False):
    if max(bins_for_extras) > number_of_bins or any(x < 0 for x in bins_for_extras):
        raise ValueError('Attempted to allocate to a bin that doesnt exist')
    base_number, number_of_values_to_allocate = divmod(df[value_series].count(), number_of_bins)
    bins_for_extras = bins_for_extras[:number_of_values_to_allocate]
    if number_of_values_to_allocate == 0:
        df['bins'] = pd.qcut(df[value_series], number_of_bins, labels=labels)
        return df
    elif number_of_values_to_allocate > len(bins_for_extras):
        raise ValueError('There are more values to allocate than the list provided, please select more bins')
    bins = {}
    for i in range(number_of_bins):
        number_of_values_in_bin = base_number
        if i in bins_for_extras:
            number_of_values_in_bin += 1
        bins[i] = number_of_values_in_bin
    df1 = df.copy()
    df1['rank'] = df1[value_series].rank()
    df1 = df1.sort_values(by=['rank'])
    df1['bins'] = 0
    row_to_start_allocate = 0
    row_to_end_allocate = 0
    for bin_number, number_in_bin in bins.items():
        row_to_end_allocate += number_in_bin
        bins.update({bin_number: [number_in_bin, row_to_start_allocate, row_to_end_allocate]})
        row_to_start_allocate = row_to_end_allocate
    conditions = [df1['rank'].iloc[v[1]: v[2]] for k, v in bins.items()]
    series_to_add = pd.Series()
    for idx, series in enumerate(conditions):
        series[series > -1] = idx
        series_to_add = series_to_add.append(series)
    df1['bins'] = series_to_add
    df1 = df1.reset_index()
    return df1

这不是很漂亮,但是可以完成工作。您传入数据框,带有值的列的名称以及应在其中分配任何额外值的bin的有序列表。我很乐意就如何改进此代码提供一些建议。