使用dplyr在filter()中的which()函数

时间:2019-01-20 18:47:00

标签: r dplyr filtering which

我正在尝试过滤数据集,然后将离群值设置为均值。样本数据框:

,

我使用%2C过滤异常值,然后尝试根据校正后的(非异常值)平均值对TEAM_FIELDING_E列进行突变:

structure(list(INDEX = c(1, 2, 3, 4, 5, 6), TARGET_WINS = c(39, 
70, 86, 70, 82, 75), TEAM_BATTING_H = c(1445, 1339, 1377, 1387, 
1297, 1279), TEAM_BATTING_2B = c(194, 219, 232, 209, 186, 200
), TEAM_BATTING_3B = c(39, 22, 35, 38, 27, 36), TEAM_BATTING_HR = c(13, 
190, 137, 96, 102, 92), TEAM_BATTING_BB = c(143, 685, 602, 451, 
472, 443), TEAM_BATTING_SO = c(842, 1075, 917, 922, 920, 973), 
    TEAM_BASERUN_SB = c(NA, 37, 46, 43, 49, 107), TEAM_BASERUN_CS = c(NA, 
    28, 27, 30, 39, 59), TEAM_BATTING_HBP = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), TEAM_PITCHING_H = c(9364, 
    1347, 1377, 1396, 1297, 1279), TEAM_PITCHING_HR = c(84, 191, 
    137, 97, 102, 92), TEAM_PITCHING_BB = c(927, 689, 602, 454, 
    472, 443), TEAM_PITCHING_SO = c(5456, 1082, 917, 928, 920, 
    973), TEAM_FIELDING_E = c(1011, 193, 175, 164, 138, 123), 
    TEAM_FIELDING_DP = c(NA, 155, 153, 156, 168, 149)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

这将返回错误dplyr(原始数据集包含303个train %>% filter(which(boxplot.stats(train$TEAM_FIELDING_E)$out %in% train$TEAM_FIELDING_E, arr.ind = TRUE) == TRUE) %>% mutate( TEAM_FIELDING_E = NA, TEAM_FIELDING_E = mean(train$TEAM_FIELDING_E) ) 离群值和2276行)。如何利用Error in filter_impl(.data, quo) : Result must have length 2276, not 303来使我的TEAM_FIELDING_E仅影响那些过滤的行?

1 个答案:

答案 0 :(得分:1)

int k = 0; for (int i = 0; i < duplicates.size() -1; i++){ for (int j = i + 1; j < duplicates.size(); j++){ if (duplicates.get(i).equalsIgnoreCase(duplicates.get(j))){ mass[k] = duplicates.get(i); k++; } } } 动词中,使用裸变量名称,而不使用dplyr[[。另外,如果您要过滤一个值,则可以直接过滤该值,而不用尝试使用$确定匹配的位置。

在这种情况下,您可以在which内使用if_else得到想要的东西。

mutate