假设我有以下data.table
:
dt <- data.table(id = c(rep(1, 5), rep(2, 4)),
sex = c(rep("H", 5), rep("F", 4)),
fruit = c("apple", "tomato", "apple", "apple", "orange", "apple", "apple", "tomato", "tomato"),
key = "id")
id sex fruit
1: 1 H apple
2: 1 H tomato
3: 1 H apple
4: 1 H apple
5: 1 H orange
6: 2 F apple
7: 2 F apple
8: 2 F tomato
9: 2 F tomato
每一行代表某人(由id
和sex
标识)吃fruit
的事实。我想计算fruit
每次sex
被吃掉的次数。我可以用:
dt[ , .N, by = c("fruit", "sex")]
给出了:
fruit sex N
1: apple H 3
2: tomato H 1
3: orange H 1
4: apple F 2
5: tomato F 2
问题是,通过这样做,我失去了orange
sex == "F"
的计数,因为这个计数为0.有没有办法在不丢失零组合的情况下进行此聚合计数?
非常清楚,所需的结果如下:
fruit sex N
1: apple H 3
2: tomato H 1
3: orange H 1
4: apple F 2
5: tomato F 2
6: orange F 0
非常感谢!
答案 0 :(得分:9)
似乎最简单的方法是在data.table中明确提供所有类别组合,传递给i=
,设置by=.EACHI
以迭代它们:
setkey(dt, sex, fruit)
dt[CJ(sex, fruit, unique = TRUE), .N, by = .EACHI]
# sex fruit N
# 1: F apple 2
# 2: F orange 0
# 3: F tomato 2
# 4: H apple 3
# 5: H orange 1
# 6: H tomato 1
答案 1 :(得分:7)
一种方法是将sex
或id
更改为因子(id
在这里是多余的?)
dt[, sex := factor(sex)]
dt[, .(sex=levels(sex), N=c(table(sex))), by=fruit]
# fruit sex N
# 1: apple F 2
# 2: apple H 3
# 3: tomato F 2
# 4: tomato H 1
# 5: orange F 0
# 6: orange H 1
或者您可以将fruit
更改为sex
的因子和分组:
dt[, fruit := factor(fruit)]
dt[, .(fruit = levels(fruit), N=c(table(fruit))),by=sex]
# sex fruit N
# 1: H apple 3
# 2: H orange 1
# 3: H tomato 1
# 4: F apple 2
# 5: F orange 0
# 6: F tomato 2
但我怀疑如果你的data.table
很大,那么取决于table
可能不是一个好主意。在这种情况下,using CJ
from your earlier question可能是要走的路。也就是说,首先进行聚合,然后进行连接。
out <- setkey(dt, sex, fruit)[, .N,
by="sex,fruit"][CJ(c("H","F"),
c("apple","tomato","orange")),
allow.cartesian=TRUE][is.na(N), N := 0L]
# sex fruit N
# 1: F apple 2
# 2: F orange 0
# 3: F tomato 2
# 4: H apple 3
# 5: H orange 1
# 6: H tomato 1