我正在尝试按如下方式存储数据:
pd.cut(df['col'], np.arange(0,1.2, 0.2),include_lowest=True))
但我想确保任何大于1的数据也包含在最后一个bin中。我可以在几行中做到这一点,但想知道是否有人知道这样做的单线/更多pythonic方式?
PS - 我不打算做qcut--我需要用它们的值来分隔箱子,而不是记录的数量。
答案 0 :(得分:2)
解决方案1:准备labels
(使用DF的前5行)并将1
替换为np.inf
参数中的bins
:
In [67]: df
Out[67]:
a b c
0 1.698479 0.337989 0.002482
1 0.903344 1.830499 0.095253
2 0.152001 0.439870 0.270818
3 0.621822 0.124322 0.471747
4 0.534484 0.051634 0.854997
5 0.980915 1.065050 0.211227
6 0.809973 0.894893 0.093497
7 0.677761 0.333985 0.349353
8 1.491537 0.622429 1.456846
9 0.294025 1.286364 0.384152
In [68]: labels = pd.cut(df.a.head(), np.arange(0,1.2, 0.2), include_lowest=True).cat.categories
In [69]: pd.cut(df.a, np.append(np.arange(0, 1, 0.2), np.inf), labels=labels, include_lowest=True)
Out[69]:
0 (0.8, 1]
1 (0.8, 1]
2 [0, 0.2]
3 (0.6, 0.8]
4 (0.4, 0.6]
5 (0.8, 1]
6 (0.8, 1]
7 (0.6, 0.8]
8 (0.8, 1]
9 (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]
<强>解释强>
In [72]: np.append(np.arange(0, 1, 0.2), np.inf)
Out[72]: array([ 0. , 0.2, 0.4, 0.6, 0.8, inf])
In [73]: labels
Out[73]: Index(['[0, 0.2]', '(0.2, 0.4]', '(0.4, 0.6]', '(0.6, 0.8]', '(0.8, 1]'], dtype='object')
解决方案2: clip所有值均大于1
In [70]: pd.cut(df.a.clip(upper=1), np.arange(0,1.2, 0.2),include_lowest=True)
Out[70]:
0 (0.8, 1]
1 (0.8, 1]
2 [0, 0.2]
3 (0.6, 0.8]
4 (0.4, 0.6]
5 (0.8, 1]
6 (0.8, 1]
7 (0.6, 0.8]
8 (0.8, 1]
9 (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]
<强>解释强>
In [75]: df.a
Out[75]:
0 1.698479
1 0.903344
2 0.152001
3 0.621822
4 0.534484
5 0.980915
6 0.809973
7 0.677761
8 1.491537
9 0.294025
Name: a, dtype: float64
In [76]: df.a.clip(upper=1)
Out[76]:
0 1.000000
1 0.903344
2 0.152001
3 0.621822
4 0.534484
5 0.980915
6 0.809973
7 0.677761
8 1.000000
9 0.294025
Name: a, dtype: float64