Question

我试图在我的数据中找到零方差的任何变量（即常量连续变量）。我想出了如何用lapply做，但我想使用dplyr，因为我正在尝试遵循整洁的数据原则。我可以使用dplyr创建一个只有方差的向量，但它是我发现值不等于零的最后一步，并返回让我困惑的变量名。

这是代码

library(PReMiuM)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.4
#> ✔ tidyr   0.7.2     ✔ stringr 1.2.0
#> ✔ readr   1.2.0     ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()


setwd("~/Stapleton_Lab/Projects/Premium/hybridAnalysis/")

# read in data from analysis script
df <- read_csv("./hybrid.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   Exp = col_character(),
#>   Pedi = col_character(),
#>   Harvest = col_character()
#> )
#> See spec(...) for full column specifications.

# checking for missing variable
# df %>% 
#     select_if(function(x) any(is.na(x))) %>% 
    # summarise_all(funs(sum(is.na(.))))


# grab month for analysis
may <- df %>% 
    filter(Month==5)
june <- df %>% 
    filter(Month==6)
july <- df %>% 
    filter(Month==7)
aug <- df %>% 
    filter(Month==8)
sept <- df %>% 
    filter(Month==9)
oct <- df %>% 
    filter(Month==10)

# check for zero variance in continuous covariates
numericVars <- grep("Min|Max",names(june))

zero <- which(lapply(june[numericVars],var)==0,useNames = TRUE)

noVar <- june %>% 

    select(numericVars) %>% 

    summarise_all(var) %>% 

    filter_if(all, all_vars(. != 0))
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

Answer 1

通过一个可重复的例子，我认为你的目标是在下面。请注意，正如Colin指出的那样，我没有处理你用字符变量选择变量的问题。有关详细信息，请参阅他的回答。

# reproducible data
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7

library(dplyr)

mtcars2 %>% 
  summarise_all(var) %>% 
  select_if(function(.) . == 0) %>% 
  names()
# [1] "mpg"  "qsec"

就个人而言，我认为这会模糊你正在做的事情。下面的一个使用purrr包（如果你希望留在tidyverse中）将是我的偏好，并有一个写得很好的评论。

library(purrr)

# Return a character vector of variable names which have 0 variance
names(mtcars2)[which(map_dbl(mtcars2, var) == 0)]
names(mtcars2)[map_lgl(mtcars2, function(x) var(x) == 0)]

如果你想优化速度，坚持使用基础R

# Return a character vector of variable names which have 0 variance
names(mtcars2)[vapply(mtcars2, function(x) var(x) == 0, logical(1))]

Answer 2

你有两个问题。

1。将列的名称作为变量传递给`select()`

关于这一点的小插曲就在这里。 programming with dplyr。这里的解决方案是使用select函数的select_at()范围变体。

2。方差等于0

noVar <- june %>% 
    select_at(.vars=numericVars) %>% 
    summarise_all(.funs=var) %>%
    filter_all(any_vars(. == 0))

Answer 3

如果唯一计数为1，则选择列，然后使用@Benjamin's示例数据 mtcars2 获取列名：

mtcars2 %>% 
  select_if(function(.) n_distinct(.) == 1) %>% 
  names()
# [1] "mpg"  "qsec"

Answer 4

这里的答案都很好，但是由于 dplyr 1.0.0 弃用了范围变体（例如 select_if、select_at、filter_all），这里是使用 @Benjamin 提供的相同 repex 数据的更新：

mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7

mtcars2 %>% 
  map_df( ~ var(.)) %>% 
  select(where( ~ . == 0))

给予

# A tibble: 1 x 2
    mpg  qsec
  <dbl> <dbl>
1     0     0

或%>% names之后：

[1] "mpg"  "qsec"

使用dplyr获取零方差的列名称

4 个答案:

1。将列的名称作为变量传递给`select()`

2。方差等于0

使用dplyr获取零方差的列名称

4 个答案:

1。将列的名称作为变量传递给select()

2。方差等于0

1。将列的名称作为变量传递给`select()`