将数字数据更改为分类数据

时间:2014-06-11 18:33:28

标签: r categorization

以下是数据集的link。我正在尝试对我的数据进行分类。 DR_AGE工作得很好。

setwd("~/data1")
a2 <- read.csv("data1.csv")
dim(a2)
[1] 11503     7
names(a2)
[1] "CR_HOUR"  "adt"      "ln"       "pav"      "DR_AGE"   "NUM_OCC"  "VEH_YEAR"

## categorize DR_AGE

a2$DR_AGE[a2$DR_AGE < 25] <- "15-24"
a2$DR_AGE[a2$DR_AGE>24 & a2$DR_AGE < 35] <- "25-34"
a2$DR_AGE[a2$DR_AGE >34 & a2$DR_AGE < 45] <- "35-44"
a2$DR_AGE[a2$DR_AGE >44 & a2$DR_AGE < 55] <- "45-54"
a2$DR_AGE[a2$DR_AGE >54 & a2$DR_AGE < 65] <- "55-64"
a2$DR_AGE[a2$DR_AGE >64 & a2$DR_AGE < 75] <- "65-74"
a2$DR_AGE[a2$DR_AGE >74 ] <- "75 plus"
a2$DR_AGE <- factor(a2$DR_AGE)
table(a2[, "DR_AGE"])                 ## All categories are generated. 
  15-24   25-34   35-44   45-54   55-64   65-74 75 plus 
   2298    2118    1638    1526    1036     511     350 

但是当我尝试对CR_HOUR或VEH_YEAR进行分类时出现了问题。

## categorize CR_HOUR  
a2$CR_HOUR[a2$CR_HOUR < 7] <- "00-06"
a2$CR_HOUR[a2$CR_HOUR>6 & a2$CR_HOUR < 13] <- "07-12"
a2$CR_HOUR[a2$CR_HOUR >12 & a2$CR_HOUR < 19] <- "13-18"
a2$CR_HOUR[a2$CR_HOUR >18 ] <- "19-24"
a2$CR_HOUR <- factor(a2$CR_HOUR)
table(a2[, "CR_HOUR"])              ### "07-12" is not generated. ????

00-06    10    11    12 13-18 19-24 
 1234   303   338   378  4152  5096 

## categorize VEH_YEAR
a2$VEH_YEAR[a2$VEH_YEAR >1930 & a2$VEH_YEAR <1991] <- "1990 and Before"
a2$VEH_YEAR[a2$VEH_YEAR>1990 & a2$VEH_YEAR < 2001] <- "1991-2000"
a2$VEH_YEAR[a2$VEH_YEAR>2000 & a2$VEH_YEAR < 2011] <- "2001-2010"
a2$VEH_YEAR[a2$VEH_YEAR >2010] <- "2011 and After"
a2$VEH_YEAR<- factor(a2$VEH_YEAR)
table(a2[, "VEH_YEAR"])              ### "!990 and Before" is not generated. ????

     1991-2000      2001-2010 2011 and After 
          4842           4763             57 

我正在努力解决这个问题。任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:1)

问题是当你做

a2$CR_HOUR[a2$CR_HOUR < 7] <- "00-06"

您正在为数字列指定字符值。这会导致CR_HOUR的数据类型更改为字符,并通过向下比较来混淆。这不是重新编码数据的有效方法。最好为分类名称创建一个新的字符向量,然后将其添加到data.frame中,或者在完成所有替换后替换当前列。

如果你有这样的范围,cut()命令可能非常有用。例如

agebr<-c(14,24,34,44,54,64,74,Inf)
a2$DR_AGE <-cut(a2$DR_AGE, breaks=agebr, 
    labels=paste(head(agebr,-1)+1, tail(agebr,-1), sep="-"))
table(a2$DR_AGE)

hourbr<-c(0,6,12,18,24)
a2$CR_HOUR <- cut(a2$CR_HOUR, breaks=hourbr, 
     labels=paste(sprintf("%02d", ifelse(head(hourbr,-1)>0,head(hourbr,-1)+1,0)),
     sprintf("%02d",tail(hourbr,-1)), sep="-"), include.lowest=T)
table(a2$CR_HOUR)

a2$VEH_YEAR <- cut(a2$VEH_YEAR, breaks=c(0,1990,2000,2010,Inf), 
    labels=c("1990 and Before","1991-2000","2001-2010","2011 and After"))
table(a2$VEH_YEAR)

它有点乱,因为我试图制作相同的标签,但功能本身很容易使用。