我在R中的因子是$100,001 - $150,000
,over $150,000
,$25,000
等形式的工资范围,并希望将这些因素转换为数值(例如转换因子{ {1}}到整数125000)。
同样,我有$100,001 - $150,000
,High School Diploma
,Current Undergraduate
等教育类别,我想为其分配数字(例如,给予PhD
更高的价值比PhD
)。
如果数据框包含这些值,我该怎么做?
答案 0 :(得分:10)
转换货币
# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000"), educ = c("High School Diploma", "Current Undergraduate",
"PhD"),stringsAsFactors=FALSE)
# Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)
# remove text
temp <- gsub("[[:alpha:]]","", temp)
# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))
对于您的教育水平 - 如果您想要数字
df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
"Current Undergraduate", "PhD")))
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 High School Diploma 125000.5 1
# 2 over $150,000 Current Undergraduate 150000.0 2
# 3 $25,000 PhD 25000.0 3
的修改
缺少/ NA值无关紧要
# Data that includes missing values
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000" , NA), educ = c(NA, "High School Diploma",
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)
重新运行以上命令以获取
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 <NA> 125000.5 NA
# 2 over $150,000 High School Diploma 150000.0 1
# 3 $25,000 Current Undergraduate 25000.0 2
# 4 <NA> PhD NA 3
答案 1 :(得分:8)
您可以在car
包中使用重新编码功能。
例如:
library(car)
df$salary <- recode(df$salary,
"'$100,001 - $150,000'=125000;'$150,000'=150000")
有关如何使用此功能的更多信息,请参阅帮助文件。
答案 2 :(得分:0)
我只是制作一个映射到你的因子级别的值的矢量并将它们映射到。下面的代码是一个不太优雅的解决方案,因为我可以&#39;弄清楚如何使用向量进行索引,但是如果您的数据不是非常大,这将完成工作。假设我们要将fact
的因子元素映射到vals
中的数字:
fact<-as.factor(c("a","b","c"))
vals<-c(1,2,3)
#for example:
vals[levels(fact)=="b"]
# gives: [1] 2
#now make an example data frame:
sample(1:3,10,replace=T)
data<-data.frame(fact[sample(1:3,10,replace=T)])
names(data)<-c("myvar")
#our vlookup function:
vlookup<-function(fact,vals,x) {
#probably should do an error checking to make sure fact
# and vals are the same length
out<-rep(vals[1],length(x))
for (i in 1:length(x)) {
out[i]<-vals[levels(fact)==x[i]]
}
return(out)
}
#test it:
data$myvarNumeric<-vlookup(fact,vals,data$myvar)
这应该适用于您所描述的内容。