在R中的分箱年龄

时间:2014-08-18 23:46:49

标签: r binning

我试图编写一个能够分类到不同群体的功能。

假设我的数据如下:

  

birthyear

1987 1995 1994 1981 1994 1989 1985 1987 1996 1981 1980 1994 1996 1983 1949 1988
1998 1977 1967 1968

我的功能是将出生年份转换为年龄,然后根据名为agebreaks的数据框将它们分成10个不同类别中的1个:

>agebreaks
                Category Birth.min Birth.max
1       14 to 19 years      2000      1995
2       20 to 24 years      1994      1990
3       25 to 34 years      1989      1980
4       35 to 44 years      1979      1970
5       45 to 54 years      1969      1960
6       55 to 59 years      1959      1955
7       60 to 64 years      1954      1950
8       65 to 74 years      1949      1940
9       75 to 84 years      1939      1930
10   85 years and over      1959      1864

功能:

    bin.age <- function(burthyear,agebreak,2014){
    p.ages <- yyyy-df$Age
    ab     <- as.data.frame(agebreak)
    min.ab <- yyyy-ab$Birth.min
    max.ab <- yyyy-ab$Birth.max
    avec   <- sort(c(min.ab[1],max.ab[1],min.ab[2],max.ab[2],min.ab[3],max.ab[3],min.ab[4],max.ab[4],min.ab[5],max.ab[5],min.ab[6],max.ab[6],min.ab[7],max.ab[7],min.ab[8],max.ab[8],min.ab[9],max.ab[9],min.ab[10],max.ab[10]))


    tmp <- findInterval(p.ages, avec)
    tt  <- table(tmp)
    names(tt)<-c("14 to 19 years","20 to 24 years","25 to 34 years","35 to 44 years","45 to 54 years","55 to 59 years","60 to 64 years","65 to 74 years","75 to 84 years","85 years and over")
return(tt)
}

我想要的是所有14到19岁的孩子,20到24岁的孩子分组,等等。我获得的不是所需的10组,是20个18组。我尝试过使用cut()也无济于事。有什么建议吗?

1 个答案:

答案 0 :(得分:1)

cut()可能是正确的功能。问题是你只需要指定范围的断点,而不是开始和结束间隔。该措施被认为是连续的。

#input data
birthyear <- c(1987, 1995, 1994, 1981, 1994, 1989, 1985, 1987, 1996, 1981, 
    1980, 1994, 1996, 1983, 1949, 1988, 1998, 1977, 1967, 1968)
agebreaks <- c(1864, 1929, 1939,1949,1954,1959,1969,1979,1989,1994,2000)

#cut
a < -cut(birthyear, agebreaks, include.lowest=T)
#rename
levels(a) <- rev(c("14 to 19 years","20 to 24 years","25 to 34 years",
    "35 to 44 years","45 to 54 years","55 to 59 years","60 to 64 years",
    "65 to 74 years","75 to 84 years","85 years and over"))

#table
as.data.frame(table(a))

#result
                   a Freq
1  85 years and over    0
2     75 to 84 years    0
3     65 to 74 years    1
4     60 to 64 years    0
5     55 to 59 years    0
6     45 to 54 years    2
7     35 to 44 years    1
8     25 to 34 years    9
9     20 to 24 years    3
10    14 to 19 years    4