Question

尝试在包含大量列的非常大的数据框中重新编码NAs。我已将列名存储在字符向量（num_var）中，并将命名向量中的不同列的替换值（median.to.replace）存储。在每列中，NA应该用median.to.replace中的正确值替换。

使用seq_along循环内的代码手动运行代码并手动指定每个列名称没问题

然而，当我尝试这个简单的代码时，所有的NAs都没有被重新编码，而一些NAs被不正确的值取代了？

for (name_col in seq_along(num_var))
{
  na_rows <- is.na(allProspect.tst[,name_col]) 
  allProspect.tst[na_rows,name_col] <- median.to.replace[name_col]

}

任何人都有指向哪个错误的指针？尝试使用快速且内存有效的方法来实现这个目标

Answer 1

如果您使用data.table而不是data.frame，这将更快地运行。在这里，我创建了一个随机数据集，其中包含来自mtcars数据集的缺失值，然后使用查找表来替换这些缺失值。

library(data.table)
set.seed(44)
f_dowle<-function(DT,value=-1,col) { #copied and edited this function from elsewhere
  set(DT,which(is.na(DT[[col]])),col,value)
}

data(mtcars)

setDT(mtcars)

for(i in colnames(mtcars)){
  rand_na<-sample(1:nrow(mtcars),3)
  mtcars[rand_na,eval(as.name(i)):=NA]

}
head(mtcars) #showing random missing values

        mpg cyl disp  hp drat    wt  qsec vs am gear carb
1: 21.0  NA  160  NA 3.90 2.620 16.46  0  1    4    4
2: 21.0   6   NA 110 3.90    NA 17.02  0  1    4    4
3: 22.8   4  108  NA 3.85 2.320 18.61  1  1    4    1
4: 21.4   6   NA 110 3.08 3.215 19.44  1  0    3    1
5: 18.7  NA  360 175   NA 3.440 17.02  0  0    3    2
6: 18.1   6  225 105 2.76    NA 20.22  1  0    3    1

lkp_dt<-data.table(column=colnames(mtcars),value=1:11)
for(i in colnames(mtcars)){
  value=lkp_dt[column==i,value]
  f_dowle(mtcars,value=value,col=i)

}

head(mtcars) #missing values replaced

    mpg cyl disp  hp drat    wt  qsec vs am gear carb
1: 21.0   2  160   4 3.90 2.620 16.46  0  1    4    4
2: 21.0   6    3 110 3.90 6.000 17.02  0  1    4    4
3: 22.8   4  108   4 3.85 2.320 18.61  1  1    4    1
4: 21.4   6    3 110 3.08 3.215 19.44  1  0    3    1
5: 18.7   2  360 175 5.00 3.440 17.02  0  0    3    2
6: 18.1   6  225 105 2.76 6.000 20.22  1  0    3    1

Answer 2

根据您的评论，向量num_var不是从数据框的第一列开始而是不连续的，那么您需要这个

# simple example with just four columns
allProspect.tst <- data.frame(one=c(1:3,8), two=c(NA,4:6), three=1:4, four= c(5,NA,7, 8))
# want to replace NAs in columns "two" and "four" with values 5 and 7, respectively
num_var <- c("two","four")
median.to.replace <- c(5, 7)
# let's see the data before replacement
print(allProspect.tst)
##  one two three four
##1   1  NA     1    5
##2   2   4     2   NA
##3   3   5     3    7
##4   8   6     4    8

# just loop over the collection of column names (not indices)
for (name_col in num_var) {
  na_rows <- is.na(allProspect.tst[,name_col])
  # key is to get the corresponding element in median.to.replace 
  # using which() index in num_var has value equal name_col
  allProspect.tst[na_rows,name_col] <- median.to.replace[which(num_var==name_col)]
}
# now let's see the replaced data
print(allProspect.tst)
##  one two three four
##1   1   5     1    5
##2   2   4     2    7
##3   3   5     3    7
##4   8   6     4    8

更新：提高效率

有许多方法可以使替换操作对于大量列更有效，但最基本的方法是使用来自R *apply的{{1}}函数族look here for an excellent overview包。更新后的代码如下：

base

注意：

原始replace.with.median <- function(col, median.val, df) { na_rows <- is.na(df[, col]) df[na_rows, col] <- median.val return(df[, col]) } allProspect.tst[, num_var] <- mapply(replace.with.median, num_var, median.to.replace, MoreArgs=list(df=allProspect.tst)) print(allProspect.tst) ## one two three four ##1 1 5 1 5 ##2 2 4 2 7 ##3 3 5 3 7 ##4 8 6 4 8循环的主体封装在函数for中。输入参数是：
- replace.with.median：要查找要替换的col
- NA：来自median.val
- median.to.replace：包含数据的数据框
此函数返回df的{{1}}列，其中col被替换为median.val。
根据以上链接使用df：

当你有几个数据结构（例如矢量，列表）并且你想要将函数应用于每个的第一个元素，然后是每个元素的第二个元素等时，

在这里，我们希望将函数NA应用于＆＃34;锁定步骤＆＃34;中的两个向量mapply和replace.with.median。对彼此。此外，我们通过num_var的{{1}}参数向median.to.replace提供数据框{/ 1}}。
从allProspect.tst返回的内容是已替换replace.with.median的列向量的集合。然后，我们将MoreArgs的相应列替换为这些。

希望这有帮助。

＆＃39; seq_along＆＃39;命名列并用适当的值

2 个答案: