Question

使用数据框时，通常需要一个子集。但是不鼓励使用子集函数。以下代码的问题是数据框名称重复两次。如果您复制并粘贴和使用munge代码，很容易意外地无法更改第二次提到adf，这可能是一场灾难。

adf=data.frame(a=1:10,b=11:20)
print(adf[which(adf$a>5),])  ##alas, adf mentioned twice
print(with(adf,adf[{a>5},])) ##alas, adf mentioned twice
print(subset(adf,a>5)) ##alas, not supposed to use subset

有没有办法在不提及adf两次的情况下编写上述内容？不幸的是，使用with（）或within（），我似乎无法访问整个adf？

子集（...）功能可以让它变得简单，但是他们警告不要使用它：

这是一个便于交互使用的便利功能。对于编程，最好使用标准的子集函数，如[，特别是参数子集的非标准评估可能会产生意想不到的后果。

Answer 1

正如@akrun所述，我会使用dplyr＆＃39; filter函数：

require("dplyr")
new <- filter(adf, a > 5)
new

在实践中，我没有发现子集符号（[ ]）有问题，因为如果我复制一段代码，我会在RStudio中使用find和replace来替换所选数据帧的所有提及码。相反，我使用dplyr因为新用户（和我自己！）更容易理解符号和语法，并且因为dplyr功能做得很好。＆＃39;

Answer 2

经过一番思考，我写了一个名为given的超简单函数：

given=function(.,...) { with(.,...) }

这样，我不必重复data.frame的名称。我还发现它比filter()快14倍。见下文：

adf=data.frame(a=1:10,b=11:20)
given=function(.,...) { with(.,...) }
with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)

使用microbenchmark

> adf=data.frame(a=1:10,b=11:20)
> given=function(.,...) { with(.,...) }
> with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
  a  b
6 6 16
7 7 17
> given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
  a  b
6 6 16
7 7 17
> dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
  a  b
1 6 16
2 7 17
> microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
Unit: microseconds
                             expr    min     lq     mean median     uq     max neval
 with(adf, adf[a > 5 & b < 18, ]) 47.897 60.441 67.59776 67.284 70.705 361.507  1000
> microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
Unit: microseconds
                            expr    min     lq     mean median    uq     max neval
 given(adf, .[a > 5 & b < 18, ]) 48.277 50.558 54.26993 51.698 56.64 272.556  1000
> microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)
Unit: microseconds
                              expr     min       lq     mean   median       uq      max neval
 dplyr::filter(adf, a > 5, b < 18) 524.965 581.2245 748.1818 674.7375 889.7025 7341.521  1000

由于变量名称的长度，我注意到given(）实际上比with()快一点。

关于given的巧妙之处在于，你可以在不进行任务的情况下内联一些内容：给定（data.frame（a = 1：10，b = 11:20）,. [a＆gt; 5＆amp; b＆lt; 18，]）

在R子集中没有使用subset（）并使用[以更简洁的方式来防止打字错误？

2 个答案: