使用dplyr获取零方差的列名称

时间:2018-01-24 14:49:38

标签: r dplyr lapply

我试图在我的数据中找到零方差的任何变量(即常量连续变量)。我想出了如何用lapply做,但我想使用dplyr,因为我正在尝试遵循整洁的数据原则。我可以使用dplyr创建一个只有方差的向量,但它是我发现值不等于零的最后一步,并返回让我困惑的变量名。

这是代码

library(PReMiuM)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.4
#> ✔ tidyr   0.7.2     ✔ stringr 1.2.0
#> ✔ readr   1.2.0     ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()


setwd("~/Stapleton_Lab/Projects/Premium/hybridAnalysis/")

# read in data from analysis script
df <- read_csv("./hybrid.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   Exp = col_character(),
#>   Pedi = col_character(),
#>   Harvest = col_character()
#> )
#> See spec(...) for full column specifications.

# checking for missing variable
# df %>% 
#     select_if(function(x) any(is.na(x))) %>% 
    # summarise_all(funs(sum(is.na(.))))


# grab month for analysis
may <- df %>% 
    filter(Month==5)
june <- df %>% 
    filter(Month==6)
july <- df %>% 
    filter(Month==7)
aug <- df %>% 
    filter(Month==8)
sept <- df %>% 
    filter(Month==9)
oct <- df %>% 
    filter(Month==10)

# check for zero variance in continuous covariates
numericVars <- grep("Min|Max",names(june))

zero <- which(lapply(june[numericVars],var)==0,useNames = TRUE)

noVar <- june %>% 

    select(numericVars) %>% 

    summarise_all(var) %>% 

    filter_if(all, all_vars(. != 0))
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

4 个答案:

答案 0 :(得分:4)

通过一个可重复的例子,我认为你的目标是在下面。请注意,正如Colin指出的那样,我没有处理你用字符变量选择变量的问题。有关详细信息,请参阅他的回答。

# reproducible data
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7

library(dplyr)

mtcars2 %>% 
  summarise_all(var) %>% 
  select_if(function(.) . == 0) %>% 
  names()
# [1] "mpg"  "qsec"

就个人而言,我认为这会模糊你正在做的事情。下面的一个使用purrr包(如果你希望留在tidyverse中)将是我的偏好,并有一个写得很好的评论。

library(purrr)

# Return a character vector of variable names which have 0 variance
names(mtcars2)[which(map_dbl(mtcars2, var) == 0)]
names(mtcars2)[map_lgl(mtcars2, function(x) var(x) == 0)]

如果你想优化速度,坚持使用基础R

# Return a character vector of variable names which have 0 variance
names(mtcars2)[vapply(mtcars2, function(x) var(x) == 0, logical(1))]

答案 1 :(得分:1)

你有两个问题。

1。将列的名称作为变量传递给select()

关于这一点的小插曲就在这里。 programming with dplyr。这里的解决方案是使用select函数的select_at()范围变体。

2。方差等于0

noVar <- june %>% 
    select_at(.vars=numericVars) %>% 
    summarise_all(.funs=var) %>%
    filter_all(any_vars(. == 0))

答案 2 :(得分:1)

如果唯一计数为1,则选择列,然后使用@Benjamin's示例数据 mtcars2 获取列名:

mtcars2 %>% 
  select_if(function(.) n_distinct(.) == 1) %>% 
  names()
# [1] "mpg"  "qsec"

答案 3 :(得分:0)

这里的答案都很好,但是 由于 dplyr 1.0.0 弃用了范围变体(例如 select_if、select_at、filter_all),这里是使用 @Benjamin 提供的相同 repex 数据的更新:

mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7

mtcars2 %>% 
  map_df( ~ var(.)) %>% 
  select(where( ~ . == 0))

给予

# A tibble: 1 x 2
    mpg  qsec
  <dbl> <dbl>
1     0     0

%>% names之后:

[1] "mpg"  "qsec"