我正在学习R课程,该课程涉及机器学习的基础知识。我们正在使用Vanderbilt Titanic数据集HERE。目标是使用R mice
包来填充缺失的age
值。我已经将我的数据分成了训练和测试样本,以及str(training)
输出:
'data.frame': 917 obs. of 14 variables:
$ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
$ age : num 29 0.92 2 25 48 63 71 18 24 26 ...
$ sibsp : int 0 1 1 1 0 1 0 1 0 0 ...
$ parch : int 0 2 2 2 0 0 0 0 0 0 ...
$ ticket : chr "24160" "113781" "113781" "113781" ...
$ fare : num 211.3 151.6 151.6 151.6 26.6 ...
$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
$ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
$ boat : chr "2" "11" "" "" ...
$ body : int NA NA NA NA NA NA 22 NA NA NA ...
$ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
然后教师继续写下:
factor_vars <- c('pclass', 'sex', 'embarked', 'survived')
training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))
impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
mice_output <- complete(mice_model)
mice_output
我理解factor_vars
篇幅 - 这些变量在结构输出中被标记为因子。我不明白的是impute_variables
是如何选择的,或者它们的确切含义。他们是否被任意选择,也许是因为教练认为像'pclass'
这样的事情(这是指导,教练或头等舱)可能有助于预测年龄(老年人可能负担得起头等舱)而像'cabin'
之类的东西没有相关性?
此外,在mice_model <- mice(training[,impute_variables], method='rf')
行中,该功能的哪一部分声明我们想要影响乘客的年龄?