dplyr中的Sum()和聚合:NA值

时间:2017-10-14 18:29:32

标签: r sum dplyr aggregate aggregate-functions

我有一个大约3,000行的数据集。可以通过https://pastebin.com/i4dYCUQX

访问数据

问题:NA导致输出,但数据中似乎没有NA。以下是当我尝试通过dplyr或aggregate汇总列的每个类别中的总值时发生的情况:

example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
example

# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))

Out:
# A tibble: 4 x 2
         size    volume
       <fctr>     <int>
1 Extra Large        NA
2       Large        NA
3      Medium 937581572
4       Small        NA

# aggregate
aggregate(volume ~ size, data=example, FUN=sum)

Out:
         size    volume
1 Extra Large        NA
2       Large        NA
3      Medium 937581572
4       Small        NA

尝试通过colSums访问该值时,似乎有效:

# Colsums
small <- example %>% filter(size == "Small")
colSums(small["volume"], na.rm = FALSE, dims = 1)

Out:
volume 
3869267348 

有人可以想象问题是什么吗?

2 个答案:

答案 0 :(得分:1)

因为值是整数而不是数字

example$volume <- as.numeric(example$volume)

aggregate(volume ~ size, data=example, FUN=sum)

         size      volume
1 Extra Large  3609485056
2       Large 11435467097
3      Medium   937581572
4       Small  3869267348

有关详情,请点击此处:

What is integer overflow in R and how can it happen?

答案 1 :(得分:1)

首先要注意的是,运行你的例子,我得到:

example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))

#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))

#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> # A tibble: 4 × 2
#>          size    volume
#>        <fctr>     <int>
#> 1 Extra Large        NA
#> 2       Large        NA
#> 3      Medium 937581572
#> 4       Small        NA

它清楚地表明你的总和是溢出整数类型。如果我们按照警告消息的建议,我们可以将整数转换为数字,然后求和:


example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum(as.numeric(.))))
#> # A tibble: 4 × 2
#>          size      volume
#>        <fctr>       <dbl>
#> 1 Extra Large  3609485056
#> 2       Large 11435467097
#> 3      Medium   937581572
#> 4       Small  3869267348

此处funs(sum)已替换为funs(sum(as.numeric(.)),它们相同,在每个群组上执行sum,但首先转换为numeric