我在R中运行了一个带有许多分类变量的分类算法,我的问题是其中许多包含超过52个因子。 (53是大多数分类算法的限制)。
因此,我想要做的是在“级别”(基于频率)大于52时替换值。 含义:我想保持最频繁的52因子水平,并用“其他”替换其他水平。
这是我的代码:
motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)
var.smry <- motor2%>%
select(BESTUURDER.PERS_POSTCODE)%>%
group_by(BESTUURDER.PERS_POSTCODE)%>%
dplyr::summarise(n())
names(var.smry)[2] <- "Count"
var.smry <- var.smry%>%
arrange(desc(Count))
var.smry$Count <- as.factor(var.smry$Count)
var.smry<- setDT(var.smry, keep.rownames = TRUE)[]
var2 <- var.smry%>%
select(rn, BESTUURDER.PERS_POSTCODE)
motor2 <- (merge(var2, motor2, by = 'BESTUURDER.PERS_POSTCODE'))
motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)
motor2$BESTUURDER.PERS_POSTCODE <- ifelse(motor2$rn >= 52, ifelse(!is.na(motor2$BESTUURDER.PERS_POSTCODE),"Other",motor2$BESTUURDER.PERS_POSTCODE), motor2$BESTUURDER.PERS_POSTCODE)
motor2 <- motor2%>%
select(-rn)
motor2$BESTUURDER.PERS_POSTCODE <- as.character(motor2$BESTUURDER.PERS_POSTCODE)
motor2$BESTUURDER.PERS_POSTCODE[is.na(motor2$BESTUURDER.PERS_POSTCODE)] <- "missing"
motor2$BESTUURDER.PERS_POSTCODE <- as.factor(motor2$BESTUURDER.PERS_POSTCODE)
我真的不明白为什么它不起作用......
非常感谢任何帮助。
非常感谢,
阿伦
答案 0 :(得分:0)
这是一个符合你想要的玩具代码(没有可重复的例子,我可以做得更好):
创建data.frame:
db<-data.frame(var=c("a","a","a","a","a","b","b","b","b","c","c","c","c","d"))
levels(db$var)
[1] "a" "b" "c" "d"
nmax_lev=2 #Number of levels without "Others" that you want mantain
Freq<-table(db$var)
Most_freq<- names(head( Freq[order(Freq,decreasing = T)],nmax_lev))
更换
replace<-!(db$var %in% Most_freq)
new_var<-as.character(db$var)
new_var[replace]<-"Other"
您的输出
db[, 'var'] <- as.factor(new_var)
levels(db$var)
[1] "a" "b" "Other"