如何找到缺失值?

时间:2017-04-04 06:50:54

标签: r knn imputation

我可以使用哪些技术(例如KNN,Max可能性)来查找缺失值? 我想使用R并试图找到一种合适的技术来估算缺失值。

样本数据如下所示:

F1  F2  F3  F4  F5  Class
Good    20  5   7   Old Normal
Good    Missing 8   8   Old Normal
Good    15  10  10  Old Normal
Good    50  10  10  Old Normal
Good    70  10  10  Old Abnormal
Bad 20  5   7   Old Abnormal
Good    20  5   80  Old Abnormal
Good    85  100 100 Old Abnormal
Good    20  100 Missing Old Abnormal
Good    24  6   8.4 Old Normal
Good    12  9.6 9.6 Old Normal
Good    18  12  12  Old Normal
Good    60  12  12  Old Normal
Good    84  Missing 12  Old Abnormal
Bad 24  6   8.4 Old Abnormal
Good    24  6   96  Old Abnormal
Good    102 120 120 Old Abnormal
Good    24  120 72  Old Abnormal

enter image description here

1 个答案:

答案 0 :(得分:1)

以下是一些可以帮助您进行分析的代码

如果数据有任何NA

any(is.na(..name of data..))

可视化缺失数据

require(VIM)
aggr(..name of data..,plot = TRUE,bars=TRUE)

计算NAs的百分比

创建一个简单的功能
propmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(is.na(x)), n=length(x), propmiss=sum(is.na(x))/length(x)))

propmiss(..数据的名称..)

删除值超过50%的行(列的功能类似)

sparse.rows = c()
for (i in 1:nrow(clust.datatrain))  {
  if (sum(length(which(is.na(clust.datatrain[i,])))) > 0.5*ncol(clust.datatrain))  {
    sparse.rows = c(sparse.rows,i)
  }
}
length(sparse.rows)  #25
clust.datatrain = clust.datatrain[-sparse.rows,]

插补

KNN

require(DMwR)
train.1=knnImputation(clust.datatrain, k = 10, scale = T, meth = "weighAvg",
                      distData = NULL)

使用下面的贝叶斯线性回归的MICE(多种方法)示例

require(mice)
xdash=mice(datafile,m=5,maxit=50,meth='norm',seed=500)
completedata=complete(xdash,1)
completedata

这一切都应该有利于分析和估算!