熊猫数据框专栏剪辑 - 更频繁地添加更多的垃圾箱

时间:2017-01-19 09:01:04

标签: python pandas numpy

我正在对定量变量(例如价格)进行分类,我想将其分类为这样的方式,即平均值在平均值周围更频繁,而在远离均值时更少。

我已经看到有可能以线性方式切割()并且感谢numpy.logspace以对数方式,但是围绕均值的分组似乎是无效的,我的想法到目前为止还没有工作和似乎效率低下。

2 个答案:

答案 0 :(得分:4)

您可以制作线性增加的容器:

import numpy as np

def make_progressive_bins(min_x, max_x, mean_x, num_bins=10):
    x_rel_lim = max(mean_x - min_x, mean_x - max_x)
    num_bins_half = num_bins // 2
    bins_right = np.arange(0, num_bins_half + 1)
    if num_bins % 2 == 1:
        bins_right = bins_right + 0.5
    bins_right = np.cumsum(bins_right)
    bins = np.concatenate([-bins_right[bins_right > 0][::-1], bins_right])
    bins = bins * (float(x_rel_lim) / bins[-1]) + mean_x
    return bins

然后您就可以使用它:

import numpy as np
import matplotlib.pyplot as plt

bins = make_progressive_bins(-20, 50, 10, 15)
plt.bar(bins - 0.1, np.ones_like(bins), 0.2)

enter image description here

答案 1 :(得分:1)

我创建了一个可能会实现您想要实现的目标的脚本,但我不确定如何将生成的剪切对象转换为直方图以查看它是否符合我的要求,所以请检查并检查告诉我它是否有效:)。

# Make normally distributed price with mean 50.
df = pd.DataFrame(data=np.random.normal(50, size=1000), columns=['price'])
df.hist(bins=30)

num_bins = 100

# I used a square function to distribute the bins more around 0 and 
# less at the outskirts of the range.
shape_func = lambda x: x**2

bin_loc = [shape_func(i) for i in range(num_bins//2)]
mirrored_bin_loc = [-x for x in bin_loc[::-1]]
bin_loc = mirrored_bin_loc + bin_loc[1:]

# Rescale and translate bins
data_mean = df.price.mean()
data_range = df.price.max() - df.price.min()
final_bin_loc = [(x + data_mean) / (data_range * num_bins) for x in bin_loc]

# display(final_bin_loc)
binned = pd.cut(df.price, bin_loc)