我正在编写一个泛型函数,它接受数据帧和列名称,并在R中返回没有异常值的干净数据帧
cooks_dist <- function(dataframe,column){
dataframe <- dataframe %>% select_if(dataframe,is.numeric)
mod <- lm(column ~ ., data=dataframe)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd,na.rm=T))]) # influential row numbers
final <- dataframe[-influential,]
return(final)
}
但是,当我运行此功能时,它会显示Error: Can't convert a list to function
可以在
找到数据 http://ucanalytics.com/blogs/wp-content/uploads/2016/09/Regression-Clean-Data.csv
答案 0 :(得分:2)
错误源自dplyr::select_if()
。我相信您需要所有数字列的子集,因此您可以使用sapply()
创建子集。 注意:当您的lm()
行产生错误时,我已经插入了最小模型。
所以我想你想要这个:
cooks_dist <- function(dataframe, column){
dataframe <- dataframe[, sapply(dataframe, is.numeric)]
mod <- lm(dataframe[, column] ~ 1, data = dataframe)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4 * mean(cooksd, na.rm = TRUE))])
final <- dataframe[-influential, ]
return(final)
}
df1 <- cooks_dist(df1, 4)
> head(df1)
X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
2 2 2 8294 8186 12694 1461 1752 210 3982000
3 3 3 11001 14399 16991 1340 1609 720 5401000
4 4 4 8301 11188 12289 1451 1748 620 5373000
5 5 5 10510 12629 13921 1770 2111 450 4662000
7 7 7 13153 11869 17811 1542 1858 1030 7224000
8 8 8 5882 9948 13315 1261 1507 1020 3772000
答案 1 :(得分:0)
我使用了这段代码,厨师的门槛为4 / n:
orig.mod <- lm(Outcome ~ Exposure, data=origdf)
origdf$cooksd <- cooks.distance(orig.mod)
origdf$cookyn <- ifelse(origdf$cooksd < 4/nrow(orig.dat), "keep","no")
minus.df <-subset(origdf, cookyn=="keep")
newmod.minuscooks <- lm(Outcome ~ Exposure, data=minus.df)