在dplyr的总结中总结NA个案

时间:2017-08-22 18:05:03

标签: r dplyr

我找不到我做错了用值和NA来总结值。我到处都读到你可以用sum()来总结一下案例,并且为了计算NA情况,可以使用sum(is.na(变量))。

实际上,我可以通过测试tibble重现这种行为:

df <- tibble(x = c(rep("a",5), rep("b",5)), y = c(NA, NA, 1, 1, NA, 1, 1, 1, NA, NA))

df %>%
  group_by(x) %>% 
  summarise(one = sum(y, na.rm = T),
            na = sum(is.na(y)))

这是预期的结果:

# A tibble: 2 x 3
      x   one    na
  <chr> <dbl> <int>
1     a     2     3
2     b     3     2

出于某种原因,我无法使用我的数据重现结果:

mydata <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Amphibians", 
"Birds", "Mammals", "Reptiles", "Plants"), class = "factor"), 
    Scenario = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
    1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Present", 
    "RCP 4.5", "RCP 8.5"), class = "factor"), year = c(1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 
    1940, 1940, 1940, 1940, 1940, 1940, 1940), random = c("obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", 
    "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs"), 
    species = c("Allobates fratisenescus", "Allobates fratisenescus", 
    "Allobates fratisenescus", "Allobates juanii", "Allobates juanii", 
    "Allobates juanii", "Allobates kingsburyi", "Allobates kingsburyi", 
    "Allobates kingsburyi", "Adelophryne adiastola", "Adelophryne adiastola", 
    "Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa", 
    "Adelophryne gutturosa", "Adelphobates quinquevittatus", 
    "Adelphobates quinquevittatus", "Adelphobates quinquevittatus"
    ), Endemic = c(1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA)), row.names = c(NA, -18L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "species", indices = list(
    9:11, 12:14, 15:17, 0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 
3L, 3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
    species = c("Adelophryne adiastola", "Adelophryne gutturosa", 
    "Adelphobates quinquevittatus", "Allobates fratisenescus", 
    "Allobates juanii", "Allobates kingsburyi")), row.names = c(NA, 
-6L), class = "data.frame", vars = "species", .Names = "species"), .Names = c("Group", 
"Scenario", "year", "random", "species", "Endemic"))

(我的数据有几百万行,我只在这里重现了一部分)

Testsum <- mydata %>% 
  group_by(Group, Scenario, year, random) %>% 
  summarise(All = n(),
            Endemic = sum(Endemic, na.rm = T),
            noEndemic = sum(is.na(Endemic)))

# A tibble: 3 x 7
# Groups:   Group, Scenario, year [?]
       Group Scenario  year random   All Endemic noEndemic
      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
1 Amphibians  Present  1940    obs     6       3         0
2 Amphibians  RCP 4.5  1940    obs     6       3         0
3 Amphibians  RCP 8.5  1940    obs     6       3         0

!!!! 我预计所有病例都没有Endemic为3,因为其中3个种类中有NA ...

我加倍检查:

Test3$Endemic %>% class
[1] "numeric"

显然,有一些非常愚蠢的事情我没有被看到......经过几个小时的捣乱。这对你们任何人都很明显吗?感谢!!!

1 个答案:

答案 0 :(得分:4)

此行为的原因是我们将Endemic指定为新的汇总变量。相反,我们应该有一个新的列名

mydata %>%
     group_by(Group, Scenario, year, random) %>%
     summarise(All = n(),
               EndemicS = sum(Endemic, na.rm = TRUE),
               noEndemic = sum(is.na(Endemic))) %>%
     rename(Endemic = EndemicS) 
# A tibble: 3 x 7
# Groups:   Group, Scenario, year [3]
#       Group Scenario  year random   All Endemic noEndemic
#      <fctr>   <fctr> <dbl>  <chr> <int>   <dbl>     <int>
#1 Amphibians  Present  1940    obs     6       3         3
#2 Amphibians  RCP 4.5  1940    obs     6       3         3
#3 Amphibians  RCP 8.5  1940    obs     6       3         3