我必须检查data.frame中所有变量的名称,如果找到匹配,则需要用中位数替换该变量中的NA值,否则其他人用均值替换NAs。
data.frame cyl_spec有11个变量,我必须替换NA如下:
我当然可以一次选择一个变量,但我正在尝试以下代码:
attach(cyl_spec)
var <- colnames(cyl_spec)
for(val in var)
{
if(val == 'viscosity'){viscosity[is.na(viscosity == T)] <- median(viscosity, na.rm = T)}
else if(val == 'wax'){wax[is.na(wax == T)] <- median(wax, na.rm = T)}
else {val[is.na(val == T)] <- mean(val, na.rm = T)}
}
detach(cyl_spec)
不知何故,代码没有做任何事情,我仍然使用此命令在变量中得到相同的NA:
sum(is.na(cyl_spec$viscosity)
此外,当我运行此代码时,我收到以下警告消息:
Warning messages:
1: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
2: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
3: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
4: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
5: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
6: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
7: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
8: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
9: In mean.default(val, na.rm = T) :
argument is not numeric or logical: returning NA
有人可以帮我找到解决方案吗,卡住了!在此先感谢!!
答案 0 :(得分:0)
您不需要循环来执行此操作。此外,测试na值的正确语法是is.na(var)
,而不是is.na(var == TRUE)
。最后,如果您想避免键入数据框的名称,则需要使用一些执行此操作的函数(如with
或dplyr
函数)。在这里,R正在寻找一个无处可寻的viscosity
对象,因为它是cyl_spec
内的一个列的名称(与其他变量名相同)。
cyl_spec$viscosity[is.na(cyl_spec$viscosity)] <- median(cyl_spec$viscosity, na.rm = T)
cyl_spec$wax[is.na(cyl_spec$wax)] <- median(cyl_spec$wax, na.rm = T)
cyl_spec$val[is.na(cyl_spec$val)] <- mean(cyl_spec$val, na.rm = T)
如果您只需要处理这个data.frame并且只处理这三个变量,我强烈建议您坚持使用这个base-r解决方案。但是,如果您希望在包含更多变量的数据框上执行此操作并且希望自动执行此操作,则可以查看dplyr::mutate_each
。以下是模拟数据的示例。
我们创建一个包含7个变量的data.frame并分配一些NA值。
library(dplyr)
set.seed(10)
df <- data.frame(n=runif(100),
m=runif(100),
d=runif(100),
o=runif(100),
e=runif(100),
f=runif(100),
g=runif(100))
df <- mutate_each(df,funs(ifelse(.>.8,NA,.)))
head(df)
n m d o e f g
1 0.50747820 0.34434350 0.2230884 0.347860110 NA NA NA
2 0.30676851 0.06132255 0.5358950 0.007992606 0.6855115 NA 0.7478783
3 0.42690767 0.36897981 0.6625291 0.401344915 0.6296311 NA 0.7225419
4 0.69310208 0.40759356 NA 0.588350693 0.7508252 0.29063776 0.5457709
5 0.08513597 NA 0.1491831 NA NA 0.07203601 0.2641231
6 0.22543662 NA 0.6700994 0.708542599 0.3600703 0.55888842 0.3057243
现在,我们向每个变量应用一个函数来推断平均值或中位数的NA值:
df <- df %>%
## Which variables are to be recoded with mean? here, n and m
mutate_each(funs(ifelse(is.na(.),mean(.,na.rm = TRUE),.)),n,m) %>%
## Which variables are to be recoded with median? here, d,o,e,f,g
mutate_each(funs(ifelse(is.na(.),median(.,na.rm = TRUE),.)),d,o,e,f,g)
head(df)
n m d o e f g
1 0.50747820 0.34434350 0.2230884 0.347860110 0.3602354 0.39956699 0.4499041
2 0.30676851 0.06132255 0.5358950 0.007992606 0.6855115 0.39956699 0.7478783
3 0.42690767 0.36897981 0.6625291 0.401344915 0.6296311 0.39956699 0.7225419
4 0.69310208 0.40759356 0.4407363 0.588350693 0.7508252 0.29063776 0.5457709
5 0.08513597 0.40892568 0.1491831 0.378731867 0.3602354 0.07203601 0.2641231
6 0.22543662 0.40892568 0.6700994 0.708542599 0.3600703 0.55888842 0.3057243
答案 1 :(得分:0)
虽然@scoa已经回答了问题,但如果您仍想使用for
循环执行此操作,只需删除attach
和detach
函数,并执行以下操作: / p>
var <- names(cyl_spec) #get column names
cols <- c('viscosity', 'wax') #get the required columns
for(val in var)
{
#loop over the required columns.
# Where it equals our required, use median, and mean elsewhere
for(i in 1:length(cols))
{
if(is.element(cols[i], val))
{
#get out rows with na values
na_rows <- is.na(cyl_spec[, val])
cyl_spec[na_rows,val] <- median(cyl_spec[,val], na.rm = T)
}
else
{
#get out rows with na values
na_rows <- is.na(cyl_spec[, val])
cyl_spec[na_rows,val] <- mean(cyl_spec[,val], na.rm = T)
}
}
}
......虽然你可能已经看到了,但它非常繁琐。强烈建议您直接输入它们,如@scoa提供的问答,或者当您想要更改超过2列时(方式)。 (另请考虑在mutate
包中使用dplyr
函数。)