Question

以下是摘录：

test = pd.DataFrame({'days': [0,31,45]})
test['range'] = pd.cut(test.days, [0,30,60])

输出：

    days    range
0   0       NaN
1   31      (30, 60]
2   45      (30, 60]

我很惊讶0不在（0,30）中，我应该怎么做才能将0归类为（0,30）？

Answer 1

test['range'] = pd.cut(test.days, [0,30,60], include_lowest=True)
print (test)
   days           range
0     0  (-0.001, 30.0]
1    31    (30.0, 60.0]
2    45    (30.0, 60.0]

见差异：

test = pd.DataFrame({'days': [0,20,30,31,45,60]})

test['range1'] = pd.cut(test.days, [0,30,60], include_lowest=True)
#30 value is in [30, 60) group
test['range2'] = pd.cut(test.days, [0,30,60], right=False)
#30 value is in (0, 30] group
test['range3'] = pd.cut(test.days, [0,30,60])
print (test)
   days          range1    range2    range3
0     0  (-0.001, 30.0]   [0, 30)       NaN
1    20  (-0.001, 30.0]   [0, 30)   (0, 30]
2    30  (-0.001, 30.0]  [30, 60)   (0, 30]
3    31    (30.0, 60.0]  [30, 60)  (30, 60]
4    45    (30.0, 60.0]  [30, 60)  (30, 60]
5    60    (30.0, 60.0]       NaN  (30, 60]

或者使用numpy.searchsorted，但days的值必须排序：

arr = np.array([0,30,60])
test['range1'] = arr.searchsorted(test.days)
test['range2'] = arr.searchsorted(test.days, side='right') - 1
print (test)
   days  range1  range2
0     0       0       0
1    20       1       0
2    30       1       1
3    31       2       1
4    45       2       1
5    60       2       2

Answer 2

pd.cut documentation
包含参数this.href = canvas.toDataURL('image/jpeg'); this.download = 'pretty_image.jpeg';

right=False

Answer 3

.cut的工作方式示例

s=pd.Series([168,180,174,190,170,185,179,181,175,169,182,177,180,171)
    pd.cut(s,3)
    #To add Lables to bins
    pd.cut(s,3, lables =["Small","Medium","Large"])

可以直接在范围内使用

Answer 4

您也可以对pd.cut（）使用标签。下面的示例包含0-10之间的学生等级。我们添加了一个名为“ grade_cat”的新列来对成绩进行分类。

bins表示时间间隔：0-4是一个时间间隔，5-6是一个时间间隔，依此类推相应的标签是“差”，“正常”等

bins = [0, 4, 6, 10]
labels = ["poor","normal","excellent"]
student['grade_cat'] = pd.cut(student['grade'], bins=bins, labels=labels)

Answer 5

@jezrael解释了pd.cut()

的几乎所有用例

我想添加的一个用例如下

pd.cut(np.array([1,2,3,4,5,6]),3)

bins 的数量由第二个参数决定，因此我们有以下输出

[(0.995,2.667],(0.995,2.667],(2.667,4.333],(2.667,4.333], (4.333,6.0], (4.333,6.0]]
Categories (3, interval[float64]): [(0.995,2.667] < (2.667,4.333] < (4.333,6.0]]

类似地，如果我们使用 bin参数（第二个参数）的数量作为 2 ，则将是输出

[(0.995, 3.5], (0.995, 3.5], (0.995, 3.5], (3.5, 6.0], (3.5, 6.0], (3.5, 6.0]]
Categories (2, interval[float64]): [(0.995, 3.5] < (3.5, 6.0]]

熊猫如何使用pd.cut（）

5 个答案: