Question

你能否建议一个很好的功能，将给定的高度倾斜的数据分成小于或等于所需数量的二进制数的二进制数，例如，如果我想将数据帧中的所有数值变量分成10个数据库，作为数据有一些高度偏斜的变量，如离散变量，只有5个可能的值，它应该将该变量分成5个区间。我尝试过在pandas中使用cut函数，但结果并不乐观。你能帮我找到一个好的功能吗？

Answer 1

如果特定列只能采用特定值，则可以使用系列的unique（）方法确定此值，例如：

import pandas as pd
import matplotlib

data_series = pd.Series([0,1,2,2,2,1,1,1,0,0,0,0])
unique_vals = list(data_series.unique())
if len(unique_vals) > 0.95*(len(data_series)):
    #almost all values are unique - plot a normal histogram
    matplotlib.pyplot.hist(data_series)
else:
    #many non-unique values - put each discrete value in its own bin
    #bins specifies the edges of the bins - need an extra edge to create a bin for the maximal value
    bins = unique_vals + [max(unique_vals)+1]
    fig = matplotlib.pyplot.hist(data_series,bins=bins)

如果您的离散值非常不均匀，这将产生一些奇怪的直方图。

绘制离散情况的一种更自然的方法可能是使用条形图，您可以使用value_counts（可能需要调整条宽，具体取决于离散值的接近程度）：

matplotlib.pyplot.bar(data_series.value_counts().index,data_series.value_counts())

对于pandas数据帧中高度倾斜的数值变量的良好分箱功能

1 个答案: