我有一个大约3,000行的数据集。可以通过https://pastebin.com/i4dYCUQX
访问数据问题:NA导致输出,但数据中似乎没有NA。以下是当我尝试通过dplyr或aggregate汇总列的每个类别中的总值时发生的情况:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
example
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
Out:
# A tibble: 4 x 2
size volume
<fctr> <int>
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
# aggregate
aggregate(volume ~ size, data=example, FUN=sum)
Out:
size volume
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
尝试通过colSums
访问该值时,似乎有效:
# Colsums
small <- example %>% filter(size == "Small")
colSums(small["volume"], na.rm = FALSE, dims = 1)
Out:
volume
3869267348
有人可以想象问题是什么吗?
答案 0 :(得分:1)
因为值是整数而不是数字
example$volume <- as.numeric(example$volume)
aggregate(volume ~ size, data=example, FUN=sum)
size volume
1 Extra Large 3609485056
2 Large 11435467097
3 Medium 937581572
4 Small 3869267348
有关详情,请点击此处:
答案 1 :(得分:1)
首先要注意的是,运行你的例子,我得到:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <int>
#> 1 Extra Large NA
#> 2 Large NA
#> 3 Medium 937581572
#> 4 Small NA
它清楚地表明你的总和是溢出整数类型。如果我们按照警告消息的建议,我们可以将整数转换为数字,然后求和:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum(as.numeric(.))))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <dbl>
#> 1 Extra Large 3609485056
#> 2 Large 11435467097
#> 3 Medium 937581572
#> 4 Small 3869267348
此处funs(sum)
已替换为funs(sum(as.numeric(.))
,它们相同,在每个群组上执行sum
,但首先转换为numeric
。