我有一个数据集,其中categorical
和NA
观察了10个变量。我想用模式替换每列的NA
值。我做了每个变量的直方图,用于识别每个观察的密度并得到模式。我知道使用。
NA
的值
我看到有相关帖子,但我已经知道要替换的值。这是链接:Replace mean or mode for missing values in R
以下是重现数据集:
> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA
以下是一个例子:
> #The head of the first five observations
> head(SmallStoredf, n=5)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1 <NA> Male <NA> <NA> <NA> <NA> <NA>
2 45-54 Female <NA> <NA> <NA> <NA> <NA>
5 45-54 Female 75k-100k Married Yes Own 150k-200k
6 25-34 Male 75k-100k Married No Own 300k-350k
7 35-44 Female 125k-150k Married Yes Own 250k-300k
Occupation Education LengthofResidence
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
5 <NA> Completed High School 9 Years
6 <NA> Completed High School 11-15 years
7 <NA> Completed High School 2 Years
在此示例中,我希望NA
中的HomeOwnerStatus
替换为Own
,HomeMarketValue
替换为350K-500K
,Occupation
替换为Professional
1}}。
编辑:我尝试输入值,但是有三列错误。
> replacementVals <- c(Age = "45-54", Gender = "Male", HouseholdIncome = "50K-75K",
+ MaritalStatus = "Single", PresenceofChildren = "No",
+ HomeOwnerStatus = "Own", HomeMarketValue = "350K-500K",
+ Occupation = "Professional", Education = "Completed High School",
+ LengthofResidence = "11-15yrs")
> indx1 <- replacementVals[col(df2)][is.na(df2[,names(replacementVals)])]
> df2[is.na(df2[,names(replacementVals)])] <- indx1
#Warning messages:
#1: In `[<-.factor`(`*tmp*`, thisvar, value = c("50K-75K", "50K-75K", :
invalid factor level, NA generated
#2: In `[<-.factor`(`*tmp*`, thisvar, value = c("350K-500K", "350K-500K", :
invalid factor level, NA generated
#3: In `[<-.factor`(`*tmp*`, thisvar, value = c("11-15yrs", "11-15yrs", :
invalid factor level, NA generated
这是输出:
> head(SmallStoredf)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1 45-54 Male <NA> Single No Own <NA>
2 45-54 Female <NA> Single No Own <NA>
5 45-54 Female 75k-100k Married Yes Own 150k-200k
6 25-34 Male 75k-100k Married No Own 300k-350k
7 35-44 Female 125k-150k Married Yes Own 250k-300k
8 55-64 Male 75k-100k Married No Own 150k-200k
Occupation Education LengthofResidence
1 Professional Completed High School <NA>
2 Professional Completed High School <NA>
5 Professional Completed High School 9 Years
6 Professional Completed High School 11-15 years
7 Professional Completed High School 2 Years
8 Professional Completed High School 16-19 years
替换了某些列中的NA
个值。
答案 0 :(得分:2)
我稍微修改了你可重复的例子,这里是设置
> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA
> dat[6,1]<-NA
#output
# x y
#1 a 1.511781168450847978590
#2 b 0.389843236411431093291
#3 b -0.621240580541803755210
#4 c -2.214699887177499881830
#5 <NA> 1.124930918143108193874
#6 c NA
#7 c -0.016190263098946087311
#8 b 0.943836210685299215051
#9 b 0.821221195098088552200
#10 <NA> 0.593901321217508826322
#11 a 0.918977371608218240873
#12 a 0.782136300731067102276
#13 c 0.074564983365190601328
#14 b -1.989351695863372793127
#15 <NA> 0.619825747894710232799
#16 b -0.056128739529000784558
#17 c -0.155795506705329295238
#18 c -1.470752383899274429169
#19 b -0.478150055108620353206
#20 c 0.417941560199702411005
现在定义您的替换值,用您想要替换NA的列标记
replacementVals<-c(x="Xreplace", y="Yreplace")
并且下一个电话可以一次性替换它们
dat[is.na(dat[,names(replacementVals)])]<-replacementVals
# x y
#1 a 1.51178116845085
#2 b 0.389843236411431
#3 b -0.621240580541804
#4 c -2.2146998871775
#5 Xreplace 1.12493091814311
#6 c Yreplace
#7 c -0.0161902630989461
#8 b 0.943836210685299
#9 b 0.821221195098089
#10 Yreplace 0.593901321217509
#11 a 0.918977371608218
#12 a 0.782136300731067
#13 c 0.0745649833651906
#14 b -1.98935169586337
#15 Xreplace 0.61982574789471
#16 b -0.0561287395290008
#17 c -0.155795506705329
#18 c -1.47075238389927
#19 b -0.47815005510862
#20 c 0.417941560199702
但正如akrun指出并随后解决的那样,这并没有很好地映射到您的第二个数据框示例。这只是从他们的评论中直接看出来的(所以无论哪种方式,他们都应该对这个问题进行检查)
我们会进行设置,除了结果之外,我不会做所有的打印
HomeOwnerStatus = c(NA,NA,NA ,"Rent", "Rent" )
HomeMarketValue = c(NA,NA,NA, "350k", "350k")
Occupation = c(NA,NA,NA, NA, NA)
SmallStoreddf<-data.frame(HomeOwnerStatus,HomeMarketValue,Occupation, stringsAsFactors=FALSE)
replacementVals<-c("HomeOwnerStatus" = "Rent", "HomeMarketValue"="350k", "Occupation"="Professional")
然后分两个步骤(可以组合成一个非常长的行),你去
#get the values that we will be replacing
indx1<-replacementVals[col(SmallStoreddf)][is.na(SmallStoreddf[, names(replacementVals)])]
#do the replacement
SmallStoreddf[is.na(SmallStoredf[,names(replacementVals)])] <-indx1
# HomeOwnerStatus HomeMarketValue Occupation
#1 Own 350k Professional
#2 Own 350k Professional
#3 Own 350k Professional
#4 Rent 350k Professional
#5 Rent 350k Professional
答案 1 :(得分:1)
尝试:(使用您的第二个示例,因为当您显示两个数据集时有点不清楚)
indx <- which(is.na(SmallStoredf), arr.ind=TRUE)
SmallStoredf[indx] <- c("Own", "350K-500K", "Professional")[indx[,2]]
SmallStoredf
# HomeOwnerStatus HomeMarketValue Occupation
#1 Own 350K-500K Professional
#2 Own 350K-500K Professional
#3 Own 350K-500K Professional
#4 Rent 350k-500k Professional
#5 Rent 500k-1mm Professional
答案 2 :(得分:0)
升级评论。
如果您想要使用最常见的类别替换缺失的数据,则变量中的类别可能相同。因此,在下面的代码中,替换是从最常见的类别中随机抽样的。
# some example data with missing
set.seed(1)
dat <- data.frame(x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
w=rnorm(20),
z=sample(letters[1:3],20,TRUE),
stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
# function to get replacement for missing
# sample is used to randomly select categories, allowing for the case
# when the maximum frequency is shared by more than one category
f <- function(x) {
tab <- table(x)
l <- sum(is.na(x))
sample(names(tab)[tab==max(tab)], l, TRUE)
}
# as we are using sample, set.seed before replacing
set.seed(1)
for(i in 1:ncol(dat)){
if(!is.numeric(dat[i]))
dat[i][is.na(dat[i])] <- f(dat[i])
}
温和警告:在以这种方式输入缺失数据之前,您应该仔细考虑。例如,最高和最低类别的收入往往更容易丢失。通过这种方法,您可能会错误地估算平均工资。您应该考虑为什么每个变量都缺失,以及假设数据是MCAR或MAR是合理的。如果是这样,我会考虑一种更强大的插补方法(mice
包)。