Question

这是一个玩具例子。

 iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length)])

 # A tibble: 3 x 2
  Species      max
  <fct>      <dbl>
1 setosa       4  
2 versicolor   3.2
3 virginica    3.8

使用which()时，输出相同。

iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[which(Sepal.Length == max(Sepal.Length))])
# summarise(max = Sepal.Width[which.max(Sepal.Length)])

# A tibble: 3 x 2
  Species      max
  <fct>      <dbl>
1 setosa       4  
2 versicolor   3.2
3 virginica    3.8

help(which)说：

给出逻辑对象的TRUE索引，允许使用数组索引。

==做同样的事情：显示TRUE＆FALSE

那么which()何时对子集有用？

Answer 1

当"=="以NA结尾时。尝试(1:2)[which(c(TRUE, NA))]对比(1:2)[c(TRUE, NA)]。

如果未删除NA，则按NA进行索引将得到NA（请参见?Extract）。但是，na.omit无法完成此删除操作，否则可能会导致TRUE的位置错误。一种安全的方法是将NA替换为FALSE，然后进行索引。但是为什么不只使用which？

Answer 2

由于这个问题是专门针对子集的，所以我想我会说明使用which()相对于链接的问题中出现了逻辑子集。

当您要提取整个子集时，在处理速度，但是使用allocate less memory需要使用which()。但是，if you only want a part of the subset（例如，展示一些奇怪的地方结果），which()具有显着的速度和内存优势，这归因于能够通过对结果进行子集化来避免两次对数据帧进行子集化改为which()。

以下是基准：

df <- ggplot2::diamonds; dim(df)
#> [1] 53940    10
mu <- mean(df$price)

bench::press(
  n = c(sum(df$price > mu), 10),
  {
    i <- seq_len(n)
    bench::mark(
      logical = df[df$price > mu, ][i, ],
      which_1 = df[which(df$price > mu), ][i, ],
      which_2 = df[which(df$price > mu)[i], ]
    )
  }
)
#> Running with:
#>       n
#> 1 19657
#> 2    10
#> # A tibble: 6 x 11
#>   expression     n      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>      <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 logical    19657    1.5ms   1.81ms   1.71ms   3.39ms      553.     5.5MB
#> 2 which_1    19657   1.41ms   1.61ms   1.56ms   2.41ms      620.    2.89MB
#> 3 which_2    19657 826.56us 934.72us 910.88us   1.41ms     1070.    1.76MB
#> 4 logical       10 893.12us   1.06ms   1.02ms   1.93ms      941.    4.21MB
#> 5 which_1       10  814.4us 944.81us 908.16us   1.78ms     1058.    1.69MB
#> 6 which_2       10 230.72us 264.45us 249.28us   1.08ms     3781.  498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

由reprex package（v0.2.0）于2018-08-19创建。

Answer 3

which除去NA元素。如果我们需要获得与which相同的行为，其中有NA s use another condition along with ==`

iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length, na.rm = TRUE) & 
                                   !is.na(Sepal.Length)])

什么时候应该使用“哪个”子集？

3 个答案: