Question

考虑以下数据：

d = data.frame(
    experiment = as.factor(c("foo", "foo", "foo", "bar", "bar")),
    si = runif(5),
    ti = runif(5)
)

我想针对每个si因素级别对ti和experiment执行相关性测试。所以我以为我会跑：

ddply(d, .(experiment), cor.test)

但是如何将si和ti的值传递给cor.test来电？我试过这个：

> ddply(d, .(experiment), cor.test, x = si, y = ti)
Error in .fun(piece, ...) : object 'si' not found
> ddply(d, .(experiment), cor.test, si, ti)
Error in match.arg(alternative) : 
  'arg' must be NULL or a character vector

有什么明显的我遗失了吗？ plyr文档不包含任何示例。我看到的大多数命令只涉及summarize作为函数调用，但是从summarize执行常用的操作不起作用，如上所示。

Answer 1

ddply按您选择的变量（此处为experiment）拆分数据框，然后将函数传递给数据框的结果子集。在您的情况下，您的函数cor.test不接受数据框作为输入，因此您需要一个转换层：

d <- data.frame(
  experiment = as.factor(c("foo", "foo", "foo", "bar", "bar", "bar")),
  si = runif(6),
  ti = runif(6)
)
ddply(d, .(experiment), function(d.sub) cor.test(d.sub$si, d.sub$ti)$statistic)
#   experiment         t
# 1        bar 0.1517205
# 2        foo 0.3387682

此外，您的输出必须类似于矢量或数据框，这就是为什么我只选择上面的$statistic，但如果您愿意，可以添加多个变量。

旁注，我必须在输入数据框中添加一个值，因为cor.test不会在2个值上运行（“bar”就是这种情况）。如果您想要更全面的统计数据，可以尝试：

ddply(d, .(experiment), function(d.sub) {
  as.data.frame(cor.test(d.sub$si, d.sub$ti)[c("statistic", "parameter", "p.value", "estimate")])
} )
#   experiment statistic parameter   p.value  estimate
# 1        bar 0.1517205         1 0.9041428 0.1500039
# 2        foo 0.3387682         1 0.7920584 0.3208567

请注意，由于我们现在返回的东西比矢量更复杂，我们需要将它强制转换为data.frame。如果要包含更复杂的值（例如置信区间，这是两个值的结果），则必须先将它们简化。

Answer 2

如果您不介意为每个实验多次运行summarize（例如，性能不是问题），则可以使用cor.test。

#note that you need at least 3 value pairs for cor.test
set.seed(42)
d = data.frame(
  experiment = as.factor(c("foo", "foo", "foo", "bar", "bar", "bar")),
  si = runif(6),
  ti = runif(6)
)

library(plyr)
ddply(d, .(experiment), summarize,
      r=cor.test(si, ti)$estimate,
      p=cor.test(si, ti)$p.value
      )

#  experiment           r         p
#1        bar  0.07401492 0.9528375
#2        foo -0.41842834 0.7251622

如何将变量传递给ddply中的自定义函数？

2 个答案: