我有一些代码可以识别数据框中的异常值,然后删除或封顶它们。我正在尝试使用apply()函数(或者可能是另一种方法)来加速删除过程。
示例数据
https://github.com/crossfitAL/so_ex_data/blob/master/subset
# this is the contents of a csv file, you will need to load it into your R session.
# set up an example decision-matrix
# rm.mat is a {length(cols) x 4} matrix -- in this example 8 x 4
# rm.mat[,1:2] - identify the values for min/max outliers, respectively.
# rm.mat[,3:4] - identify if you wish to remove min/max outliers, respectively.
cols <- c(1, 6:12) # specify the columns you wish to examine
rm.mat <- matrix(nrow = length(cols), ncol= 4,
dimnames= list(names(fico2[cols]),
c("out.min", "out.max","rm outliers?", "rm outliers?")))
# add example decision criteria
rm.mat[, 1] <- apply(fico2[, cols], 2, quantile, probs= .05)
rm.mat[, 2] <- apply(fico2[, cols], 2, quantile, probs= .95)
rm.mat[, 3] <- replicate(4, c(0,1))
rm.mat[, 4] <- replicate(4, c(1,0))
以下是我目前的子集代码:
df2 <- fico2 # create a copy of the data frame
cnt <- 1 # add a count variable
for (i in cols) {
# for each column of interest in the data frame. Determine if there are min/max
# outliers that you wish to remove, remove them.
if (rm.mat[cnt, 3] == 1 & rm.mat[cnt, 4] == 1) {
# subset / remove min and max outliers
df2 <- df2[df2[, i] >= rm.mat[cnt, 1] & df2[, i] <= rm.mat[cnt, 2], ]
} else if (rm.mat[cnt, 3] == 1 & rm.mat[cnt, 4] == 0) {
# subset / remove min outliers
df2 <- df2[df2[, i] >= rm.mat[cnt, 1], ]
} else if (rm.mat[cnt, 3] == 0 & rm.mat[cnt, 4] == 1) {
# subset / remove max outliers
df2 <- df2[df2[, i] <= rm.mat[cnt, 2], ]
}
cnt <- cnt + 1
}
建议的解决方案:
我想我应该能够通过应用类型函数来实现这一点,删除for循环/矢量化可以加快代码速度。我遇到的问题是我正在尝试应用函数 if-and-only-if 决策矩阵表明我应该这样做。 IE-使用逻辑向量rm.mat[,3] or rm.mat[,4]
来确定是否应将子集"["
应用于数据帧df2
。
您将获得任何帮助将不胜感激!另外,如果示例数据/代码足够,请告诉我。
答案 0 :(得分:0)
这是一个解决方案。只是为了澄清你的代码。希望其他人可以用它来提供更好的解决方案。
因此,如果理解,您有一个决策矩阵,如下所示:
rm.mat
c1 c2 c3 c4
amount.funded.by.investors 27925.000 NA 0 1
monthly.income 11666.670 NA 1 0
open.credit.lines 18.000 NA 0 1
revolving.credit.balance 40788.750 NA 1 0
inquiries.in.the.last.6.months 3.000 NA 0 1
debt.to.inc 28.299 NA 1 0
int.rate 20.490 NA 0 1
fico.num 775.000 NA 1 0
您尝试根据此矩阵的值
过滤大矩阵colnames(rm.mat) <- paste('c',1:4,sep='')
rm.mat <- as.data.frame(rm.mat)
apply(rm.mat,1,function(y){
h <- paste(y['c3'],y['c4'],sep='')
switch(h,
'11'= apply(df2,2, function(x)
df2[x >= y['c1'] & x <= y['c2'],]), ## we never have this!!
'10'= apply(df2,2, function(x)
df2[x >= y['c1'] , ]), ## here we apply by columns!
'01'= apply(df2,2,function(x)
df2[x <= y['c2'], ])) ## c2 is NA!! so !!!
}
)