按值分区,除了最后一个bin

时间:2016-08-17 22:23:43

标签: python pandas dataframe categorical-data binning

我正在尝试按如下方式存储数据:

pd.cut(df['col'], np.arange(0,1.2, 0.2),include_lowest=True))

但我想确保任何大于1的数据也包含在最后一个bin中。我可以在几行中做到这一点,但想知道是否有人知道这样做的单线/更多pythonic方式?

PS - 我不打算做qcut--我需要用它们的值来分隔箱子,而不是记录的数量。

1 个答案:

答案 0 :(得分:2)

解决方案1:准备labels(使用DF的前5行)并将1替换为np.inf参数中的bins

In [67]: df
Out[67]:
          a         b         c
0  1.698479  0.337989  0.002482
1  0.903344  1.830499  0.095253
2  0.152001  0.439870  0.270818
3  0.621822  0.124322  0.471747
4  0.534484  0.051634  0.854997
5  0.980915  1.065050  0.211227
6  0.809973  0.894893  0.093497
7  0.677761  0.333985  0.349353
8  1.491537  0.622429  1.456846
9  0.294025  1.286364  0.384152

In [68]: labels = pd.cut(df.a.head(), np.arange(0,1.2, 0.2), include_lowest=True).cat.categories

In [69]: pd.cut(df.a, np.append(np.arange(0, 1, 0.2), np.inf), labels=labels, include_lowest=True)
Out[69]:
0      (0.8, 1]
1      (0.8, 1]
2      [0, 0.2]
3    (0.6, 0.8]
4    (0.4, 0.6]
5      (0.8, 1]
6      (0.8, 1]
7    (0.6, 0.8]
8      (0.8, 1]
9    (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

<强>解释

In [72]: np.append(np.arange(0, 1, 0.2), np.inf)
Out[72]: array([ 0. ,  0.2,  0.4,  0.6,  0.8,  inf])

In [73]: labels
Out[73]: Index(['[0, 0.2]', '(0.2, 0.4]', '(0.4, 0.6]', '(0.6, 0.8]', '(0.8, 1]'], dtype='object')

解决方案2: clip所有值均大于1

In [70]: pd.cut(df.a.clip(upper=1), np.arange(0,1.2, 0.2),include_lowest=True)
Out[70]:
0      (0.8, 1]
1      (0.8, 1]
2      [0, 0.2]
3    (0.6, 0.8]
4    (0.4, 0.6]
5      (0.8, 1]
6      (0.8, 1]
7    (0.6, 0.8]
8      (0.8, 1]
9    (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

<强>解释

In [75]: df.a
Out[75]:
0    1.698479
1    0.903344
2    0.152001
3    0.621822
4    0.534484
5    0.980915
6    0.809973
7    0.677761
8    1.491537
9    0.294025
Name: a, dtype: float64

In [76]: df.a.clip(upper=1)
Out[76]:
0    1.000000
1    0.903344
2    0.152001
3    0.621822
4    0.534484
5    0.980915
6    0.809973
7    0.677761
8    1.000000
9    0.294025
Name: a, dtype: float64