我在R中有一个大数据集,正在与data.table
争吵。我想汇总一些数据,并返回每行的行值与总数之比。
我已经设法通过dcast
达到了大部分目标,但是我不知道最后一步。
library(data.table)
tab <- "year qtr sales value
2016 1 A 50
2016 2 A 70
2016 3 A 90
2016 4 A 100
2017 1 A 80
2017 2 A 70
2017 3 A 80
2017 4 A 110
2016 1 B 33
2016 2 B 90
2016 3 B 120
2016 4 B 60
2017 1 B 120
2017 2 B 10
2017 3 B 88
2017 4 B 99
"
dt <- fread(tab)
dcast(dt, sales ~ year, fun.agg = function(x) sum(x), value.var = 'value')
sales 2016 2017
1: A 310 340
2: B 303 317
我真正想要的是行比率(即310 /(310 + 340)等)
sales 2016 2017
1: A 0.47 0.52
2: B 0.49 0.51
这怎么办?
答案 0 :(得分:2)
只需除以rowSums
(对{em> @Ronak Shah 不需apply
表示感谢)
dt2[, -1] / rowSums(dt2[, -1])
# [,1] [,2]
# 2016 0.4769231 0.4887097
# 2017 0.5230769 0.5112903
当然在cbind
时将其移置并舍入。
dt2 <- cbind(dt2[, 1], t(round(dt2[, -1] / rowSums(dt2[, -1]), 2)))
dt2
# sales 2016 2017
# 1: A 0.48 0.52
# 2: B 0.49 0.51
这里最好的做法可能是保持简洁data.table
并按照 @ chinsoon12 在评论中指出的那样一步完成。
dt2 <- dcast(dt[, x := round(value / sum(value), 2), by=.(sales)], sales ~ year, sum, value.var='x')
dt2
# sales 2016 2017
# 1: A 0.48 0.52
# 2: B 0.49 0.51
数据
dt <- structure(list(year = c(2016L, 2016L, 2016L, 2016L, 2017L, 2017L,
2017L, 2017L, 2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L,
2017L), qtr = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L), sales = c("A", "A", "A", "A", "A", "A", "A",
"A", "B", "B", "B", "B", "B", "B", "B", "B"), value = c(50L,
70L, 90L, 100L, 80L, 70L, 80L, 110L, 33L, 90L, 120L, 60L, 120L,
10L, 88L, 99L)), row.names = c(NA, -16L), class = c("data.table",
"data.frame"))
dt2 <- dcast(dt, sales ~ year, fun.agg = function(x) sum(x), value.var = 'value')
答案 1 :(得分:2)
另一种直接的data.table
解决方案:
dt[, .(tmp = sum(value)), by = .(year, sales)
][, .(value = tmp / sum(tmp), sales), by = .(year)
][, dcast(.SD, sales ~ year)]
# sales 2016 2017
# 1: A 0.5057096 0.5175038
# 2: B 0.4942904 0.4824962
答案 2 :(得分:1)
使用tidyverse
,我们可以group_by
sales
和year
并获得每个组中的比率,并将spread
转换为宽格式。
library(tidyverse)
dt %>%
group_by(sales, year) %>%
summarise(value = sum(value)) %>%
mutate(value = value/sum(value)) %>%
spread(year, value)
# sales `2016` `2017`
# <chr> <dbl> <dbl>
#1 A 0.477 0.523
#2 B 0.489 0.511