函数'cut'的上限间隔

时间:2013-09-08 17:56:10

标签: r floating-point double

我想在R中以某种方式对数据框进行分类 假设有一个如下数据框:

> data = sample(1:500, 5000, replace = TRUE)

为了对这个数据框进行分类,我正在制作这些类:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
   (0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

如果我想要0,我只需要添加include.lowest = TRUE

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
    > table(data.cl)
data.cl
   [0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

在此示例中,这并未显示任何差异,因为0根本没有出现在此数据框中。但如果它会,例如, 4次,106类中102而不是[0,10]元素:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
    > table(data.cl)
data.cl
   [0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      106        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

更改课程限制还有另一种选择。 cut()的默认选项为right = FALSE。如果您将其更改为right = TRUE,则会获得:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE, right = FALSE)
> table(data.cl)
data.cl
   [0,10)   [10,20)   [20,30)   [30,40)   [40,50) 
       92        81        87       111       118 
  [50,60)   [60,70)   [70,80)   [80,90)  [90,100) 
      103        89        94       103       103 
[100,200) [200,350) [350,480) [480,500] 
     1003      1497      1320       199 

include.lowest现在变为“include.highest”,代价是更改了班级限制,因此在某些班级中返回不同数量的班级成员,因为班级限制略有变化。
但是,如果我想拥有数据框

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
   (0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500) 
     1002      1492      1318       194

排除 500,我该怎么办? 当然,可以说:“只需写data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 499))而不是data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500)),因为你正在处理整数。”
嗯,那是对的,但如果情况不是这样的话我将使用花车呢?如何排除500呢?

0 个答案:

没有答案