Question

我正在努力验证一个函数来计算我实验室中某个标准的通过率。这背后的数学非常简单：给定了许多通过或失败的测试，通过了多少百分比。

数据将作为一列值提供，P1（第一次测试时传递），F1（第一次测试时失败），P2或{{1} （分别在第二次测试时通过或失败）。我在下面编写了函数F2，以帮助计算总体通过率（第一次和第二次尝试）以及第一次测试和第二次测试。

为验证设置参数的质量专家给了我一个通过和失败计数的列表，我使用下面的passRate函数将其转换为向量。

在我到达test_vector数据框的第三行之前，一切都看起来很棒，其中包含来自我的质量专家的通过/失败计数。它不是返回100％的第二次测试通过率，而是返回NA ...但仅当我使用Pass时

mutate

所以这里的内容类似于我对library(dplyr) Pass <- structure(list(P1 = c(2L, 0L, 10L), F1 = c(0L, 2L, 0L), P2 = c(0L, 3L, 2L), F2 = c(0L, 2L, 0L), id = 1:3), .Names = c("P1", "F1", "P2", "F2", "id"), class = c("tbl_df", "data.frame"), row.names = c(NA, -3L))所做的事情。

mutate

比较何时使用Pass %>% group_by(id) %>% mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, pass_rate1 = P1 / (P1 + F1) * 100, pass_rate2 = P2 / (P2 + F2) * 100) Source: local data frame [3 x 8] Groups: id [3] P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2 (int) (int) (int) (int) (int) (dbl) (dbl) (dbl) 1 2 0 0 0 1 100.00000 100 NA 2 0 2 3 2 2 42.85714 0 60 3 10 0 3 1 3 100.00000 100 NA

summarise

我原本预计这些会返回相同的结果。我的猜测是Pass %>% group_by(id) %>% summarise(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100, pass_rate1 = P1 / (P1 + F1) * 100, pass_rate2 = P2 / (P2 + F2) * 100) Source: local data frame [3 x 4] id pass_rate pass_rate1 pass_rate2 (int) (dbl) (dbl) (dbl) 1 1 100.00000 100 NA 2 2 42.85714 0 60 3 3 100.00000 100 100在某个地方遇到问题，因为它假设每个组mutate行应该映射到结果中的n行（在此计算n时是否感到困惑？），n知道无论开始有多少行，它都会以1结尾。

有没有人对这种行为背后的机制有什么想法？

Answer 1

在我看来，dplyr和plyr之间存在一些干扰。我对另一个不平衡的数据集有同样的问题（因此需要分组），正好在第三个组中，变异的变量错误地是NA！然后我在家里复制了你的例子。首先，

之后

library("dplyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2")

我得到了你的结果。然后我执行了自己的脚本，其中加载了包plyr。在plyr之后警告不加载dplyr后，我的第三组中的NA消失了，您的示例也正确计算了！这是我做的（我再添加一行来查看NA是否仍在第三组中）：

> Pass <- structure(list(P1 = c(2L, 0L, 10L,8L), 
+                        F1 = c(0L, 2L, 0L, 4L), 
+                        P2 = c(0L, 3L, 2L, 2L), 
+                        F2 = c(0L, 2L, 0L, 1L), 
+                        id = 1:4), 
+                   .Names = c("P1", "F1", "P2", "F2", "id"), 
+                   class = c("tbl_df", "data.frame"), 
+                   row.names = c(NA, -4L))
> Pass %>%
+     group_by(id) %>%
+     mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+            pass_rate1 = P1 / (P1 + F1) * 100,
+            pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]

 P1    F1    P2    F2    id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int)     (dbl)      (dbl)      (dbl)
 1     2     0     0     0     1 100.00000  100.00000         NA
 2     0     2     3     2     2  42.85714    0.00000   60.00000
 3    10     0     2     0     3 100.00000  100.00000         NA
 4     8     4     2     1     4  66.66667   66.66667   66.66667

然后我做了：

> library("plyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2")
> Pass %>%
+     group_by(id) %>%
+     mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+            pass_rate1 = P1 / (P1 + F1) * 100,
+            pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]

 P1    F1    P2    F2    id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int)     (dbl)      (dbl)      (dbl)
 1     2     0     0     0     1 100.00000  100.00000        NaN
 2     0     2     3     2     2  42.85714    0.00000   60.00000
 3    10     0     2     0     3 100.00000  100.00000  100.00000
 4     8     4     2     1     4  66.66667   66.66667   66.66667

我知道这不是一个令人满意的答案，因为plyr应该在dplyr之后加载 NOT ，但也许可以帮助那些需要group_by(id)的人。或者使用plyr::mutate()。然后，您可以在dplyr之后加载plyr：

 > Pass %>%
+     group_by(id) %>%
+     plyr::mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+            pass_rate1 = P1 / (P1 + F1) * 100,
+            pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]

 P1    F1    P2    F2    id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int)     (dbl)      (dbl)      (dbl)
 1     2     0     0     0     1 100.00000  100.00000        NaN
 2     0     2     3     2     2  42.85714    0.00000   60.00000
 3    10     0     2     0     3 100.00000  100.00000  100.00000
 4     8     4     2     1     4  66.66667   66.66667   66.66667

dplyr :: mutate给出x / y = NA，总结给出x / y =实数

1 个答案: