如何从R

时间:2018-02-24 16:01:27

标签: r

我正在编写一个泛型函数,它接受数据帧和列名称,并在R中返回没有异常值的干净数据帧

cooks_dist <- function(dataframe,column){
  dataframe <- dataframe %>%  select_if(dataframe,is.numeric)
  mod <- lm(column ~ ., data=dataframe)
  cooksd <- cooks.distance(mod)

  influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd,na.rm=T))])  # influential row numbers

  final <- dataframe[-influential,]

  return(final)

}

但是,当我运行此功能时,它会显示Error: Can't convert a list to function

可以在

找到数据

http://ucanalytics.com/blogs/wp-content/uploads/2016/09/Regression-Clean-Data.csv

2 个答案:

答案 0 :(得分:2)

错误源自dplyr::select_if()。我相信您需要所有数字列的子集,因此您可以使用sapply()创建子集。 注意:当您的lm()行产生错误时,我已经插入了最小模型。

所以我想你想要这个:

cooks_dist <- function(dataframe, column){
  dataframe <- dataframe[, sapply(dataframe, is.numeric)]
  mod <- lm(dataframe[, column] ~ 1, data = dataframe)
  cooksd <- cooks.distance(mod)
  influential <- as.numeric(names(cooksd)[(cooksd > 4 * mean(cooksd, na.rm = TRUE))])
  final <- dataframe[-influential, ]
  return(final)
}

df1 <- cooks_dist(df1, 4)

收率:

> head(df1)
  X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
2 2           2      8294        8186         12694   1461    1752      210     3982000
3 3           3     11001       14399         16991   1340    1609      720     5401000
4 4           4      8301       11188         12289   1451    1748      620     5373000
5 5           5     10510       12629         13921   1770    2111      450     4662000
7 7           7     13153       11869         17811   1542    1858     1030     7224000
8 8           8      5882        9948         13315   1261    1507     1020     3772000

答案 1 :(得分:0)

我使用了这段代码,厨师的门槛为4 / n:

orig.mod <- lm(Outcome ~ Exposure, data=origdf)

origdf$cooksd <- cooks.distance(orig.mod)

origdf$cookyn <- ifelse(origdf$cooksd < 4/nrow(orig.dat), "keep","no")

minus.df <-subset(origdf, cookyn=="keep")

newmod.minuscooks <- lm(Outcome ~ Exposure, data=minus.df)