如果级别因子超过52,则替换值

时间:2018-02-09 11:14:26

标签: r if-statement replace categorical-data

我在R中运行了一个带有许多分类变量的分类算法,我的问题是其中许多包含超过52个因子。 (53是大多数分类算法的限制)。

因此,我想要做的是在“级别”(基于频率)大于52时替换值。 含义:我想保持最频繁的52因子水平,并用“其他”替换其他水平。

这是我的代码:

motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)
var.smry <- motor2%>%
  select(BESTUURDER.PERS_POSTCODE)%>%
  group_by(BESTUURDER.PERS_POSTCODE)%>%
  dplyr::summarise(n())

names(var.smry)[2] <- "Count"
var.smry <- var.smry%>%
  arrange(desc(Count))
var.smry$Count <- as.factor(var.smry$Count)

var.smry<- setDT(var.smry, keep.rownames = TRUE)[]

var2 <- var.smry%>%
  select(rn, BESTUURDER.PERS_POSTCODE)

motor2 <- (merge(var2, motor2, by = 'BESTUURDER.PERS_POSTCODE'))

motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)

motor2$BESTUURDER.PERS_POSTCODE <- ifelse(motor2$rn >= 52, ifelse(!is.na(motor2$BESTUURDER.PERS_POSTCODE),"Other",motor2$BESTUURDER.PERS_POSTCODE), motor2$BESTUURDER.PERS_POSTCODE)

motor2 <- motor2%>%
  select(-rn)

motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)
motor2$BESTUURDER.PERS_POSTCODE[is.na(motor2$BESTUURDER.PERS_POSTCODE)] <- "missing"
motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)

我真的不明白为什么它不起作用......

非常感谢任何帮助。

非常感谢,

阿伦

1 个答案:

答案 0 :(得分:0)

这是一个符合你想要的玩具代码(没有可重复的例子,我可以做得更好):

创建data.frame:

db<-data.frame(var=c("a","a","a","a","a","b","b","b","b","c","c","c","c","d"))
levels(db$var)
[1] "a" "b" "c" "d"

nmax_lev=2 #Number of levels without "Others" that you want mantain
Freq<-table(db$var)
Most_freq<- names(head( Freq[order(Freq,decreasing = T)],nmax_lev))

更换

replace<-!(db$var %in% Most_freq)
new_var<-as.character(db$var)
new_var[replace]<-"Other"

您的输出

db[, 'var'] <- as.factor(new_var)
levels(db$var)
[1] "a"     "b"     "Other"