dplyr的代码块我做错了什么？

Question

提供了类似这样的数据框：

df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
                      A=c("Some text", "Text here too", "Some other text", NA), 
                      B=c(NA, "Some random text", NA, "Random here too")))
> df
   Group               A                B
1 Group1       Some text             <NA>
2 Group1   Text here too Some random text
3 Group2 Some other text             <NA>
4 Group2            <NA>  Random here too

我想对列A和B中包含某些值的所有值求和，然后根据每个组对它们求和，得到以下数据框：

> df.expected
   Group A_n B_n
1 Group1   2   1
2 Group2   1   1

虽然这是一个愚蠢的数据框示例（原始数据框有更多的列和组，并且手动实现结果并不那么容易），但由于我可以＆＃我没有成功39; t运作因素。另外，我担心我的方法（见下文）过于冗长而且可能有些过分，这使得它不太适合我的真实数据框架，而且列数更多。

到目前为止我所做的事情：

# Manually create a new numeric column with numbers.
df$A_n = as.character(df$A)
df$A_n[!is.na(df$A_n)] <- 1
df$A_n = as.numeric(df$A_n)

df$B_n = as.character(df$B)
df$B_n[!is.na(df$B_n)] <- 1
df$B_n = as.numeric(df$B_n)

这部分工作正常，但我担心可能会有更好，更短/半自动的方式来创建新列并为其赋值。也许它甚至没必要。

我的代码的第二部分旨在根据分组变量对观察结果进行分组，并使用dplyr对每个变量中的值进行求和：

library(dplyr)  

df2 = df %>% 
      select(Group, A_n, B_n) %>% 
      group_by(Group) %>% 
      summarise_all(sum)

但是，我收到了意想不到的数据框：

> df2
# A tibble: 2 x 3
   Group   A_n   B_n
  <fctr> <dbl> <dbl>
1 Group1     2    NA
2 Group2    NA    NA

任何人都可以帮助我更好地解决这个问题和/或告诉我dplyr的代码块我做错了什么？

Answer 1

dplyr的代码块我做错了什么？

这是因为有NA个。试试

library(dplyr)  

df2 = df %>% 
      select(Group, A_n, B_n) %>% 
      group_by(Group) %>% 
      summarise_all(sum, na.rm=TRUE)

代替。

我机器上的输出：

# A tibble: 2 x 3
   Group   A_n   B_n
  <fctr> <dbl> <dbl>
1 Group1     2     1
2 Group2     1     1

我担心我的做法......过于冗长而且可能有点过分

你可以这样做：

df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
                      A=c("Some text", "Text here too", "Some other text", NA), 
                      B=c(NA, "Some random text", NA, "Random here too")))

library(dplyr)

df2 = df %>% 
    group_by(Group) %>% 
    summarise_all(.funs=function(x) length(na.omit(x)))

我机器上的输出：

# A tibble: 2 x 3
   Group     A     B
  <fctr> <int> <int>
1 Group1     2     1
2 Group2     1     1

一点解释

如果您查看help(summarise_all)，您会看到其参数为.tbl，.funs和...（我们现在不会担心省略号）。因此，我们使用管道df将group_by()提供给%>%，然后再使用管道summarise_all()将其提取到%>%。这会处理.tbl参数。 .funs参数是指定应该使用哪些函数汇总到.tbl中所有非分组列的方法。在这里，我们想知道每列的多少元素不是NA，我们可以通过将length(na.omit(x))应用于{{1}中的每个非分组列x来实现（作为一种方法） }}

关于.tbl资源的最佳建议是Chapter 5 of R for Data Science，Hadley Wickham的一本书，他编写了dplyr包（以及其他许多包）。

Answer 2

在基础R中，您可以将aggregate与标准接口一起使用（而不是公式接口）。

aggregate(cbind(A_n=df$A, B_n=df$B),  df["Group"], function(x) sum(!is.na(x)))
   Group A_n B_n
1 Group1   2   1
2 Group2   1   1

cbind要计算的变量并提供名称。在第二个参数中，包括分组变量。然后，在您运行时，将未缺失元素的指示符求和。

如何将非空因子类型的几列中的元素总和相加？

2 个答案:

dplyr的代码块我做错了什么？

我担心我的做法......过于冗长而且可能有点过分

一点解释