如何以一种简单的方式对数据进行分类?

时间:2018-12-24 17:36:05

标签: r categories cut

我想用一种简单的方式对出生年份进行分类。我尝试了cut,看起来已经不错了。但是,我还不能完美解决。

给出两个出生年份的样本

set.seed(42)
s.even <- sample(2000:2015, 100, replace=TRUE)
s.odd <- sample(1998:2017, 100, replace=TRUE)

使用“偶数”样本,输出就可以了:

df.even <- data.frame(birthyear=s.even, 
                      category=cut(s.even, 3,
                                   labels=c("youth", "young", "youngsters")))

> with(df.even, ftable(category, birthyear))
           birthyear 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
category                                                                                            
youth                   8    4    5    7    5    4    0    0    0    0    0    0    0    0    0    0
young                   0    0    0    0    0    0    7    5    6    6    8    0    0    0    0    0
youngsters              0    0    0    0    0    0    0    0    0    0    0    9    4    5    9    8

但是对于“奇数”样本,中断没有放置在正确的位置,即我希望第一类包含1998:2005,第二类包含2006:2010

df.odd <- data.frame(birthyear=s.odd.s, 
                      category=cut(s.odd.s, 3,
                                   labels=c("youth", "young", "youngsters")))

> with(df.odd, ftable(category, birthyear))
           birthyear 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
category                                                                                                                
youth                   3    3   10    6    3    3    3    0    0    0    0    0    0    0    0    0    0    0    0    0
young                   0    0    0    0    0    0    0    5    4    4    5    5    7    0    0    0    0    0    0    0
youngsters              0    0    0    0    0    0    0    0    0    0    0    0    0    2   11    9    5    2    8    2

所以我尝试以这种方式手动设置断点:

> cut(s.odd.s, s.odd.s[c(1, 
+                        which(s.odd.s %% 5 == 0 & !duplicated(s.odd.s)), 
+                        length(s.odd.s))])
  [1] <NA>        <NA>        <NA>        (1998,2000] (1998,2000] (1998,2000] (1998,2000]
  [8] (1998,2000] (1998,2000] (1998,2000] (1998,2000] (1998,2000] (1998,2000] (1998,2000]
 [15] (1998,2000] (1998,2000] (2000,2005] (2000,2005] (2000,2005] (2000,2005] (2000,2005]
 [22] (2000,2005] (2000,2005] (2000,2005] (2000,2005] (2000,2005] (2000,2005] (2000,2005]
 [29] (2000,2005] (2000,2005] (2000,2005] (2000,2005] (2000,2005] (2000,2005] (2000,2005]
 [36] (2000,2005] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010]
 [43] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010]
 [50] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010]
 [57] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2005,2010] (2010,2015] (2010,2015]
 [64] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015]
 [71] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015]
 [78] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015]
 [85] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2010,2015] (2015,2017]
 [92] (2015,2017] (2015,2017] (2015,2017] (2015,2017] (2015,2017] (2015,2017] (2015,2017]
 [99] (2015,2017] (2015,2017]
Levels: (1998,2000] (2000,2005] (2005,2010] (2010,2015] (2015,2017]

但是以某种方式排除了1998

> head(s.odd.s)
[1] 1998 1998 1998 1999 1999 1999

无论如何,也许我错过了在cut()中进行设置的选项?我还想以“偶数”转折点随意开始这三个类别,即1998:2004 2005:2009 2010:2017

0 个答案:

没有答案