我想在R
中以某种方式对数据框进行分类
假设有一个如下数据框:
> data = sample(1:500, 5000, replace = TRUE)
为了对这个数据框进行分类,我正在制作这些类:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
如果我想要0
,我只需要添加include.lowest = TRUE
:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
在此示例中,这并未显示任何差异,因为0
根本没有出现在此数据框中。但如果它会,例如, 4
次,106
类中102
而不是[0,10]
元素:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
106 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
更改课程限制还有另一种选择。 cut()
的默认选项为right = FALSE
。如果您将其更改为right = TRUE
,则会获得:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE, right = FALSE)
> table(data.cl)
data.cl
[0,10) [10,20) [20,30) [30,40) [40,50)
92 81 87 111 118
[50,60) [60,70) [70,80) [80,90) [90,100)
103 89 94 103 103
[100,200) [200,350) [350,480) [480,500]
1003 1497 1320 199
include.lowest
现在变为“include.highest
”,代价是更改了班级限制,因此在某些班级中返回不同数量的班级成员,因为班级限制略有变化。
但是,如果我想拥有数据框
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500)
1002 1492 1318 194
排除 500
,我该怎么办?
当然,可以说:“只需写data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 499))
而不是data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
,因为你正在处理整数。”
嗯,那是对的,但如果情况不是这样的话我将使用花车呢?如何排除500
呢?