我有以下data.table与R
library(data.table)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2", ...), category = c("red", "red", "blue", "red", "red", "blue", "green", "green", ...))
dt
ID category
person1 red
person1 red
person1 blue
person2 red
person2 red
person2 blue
person2 green
person2 green
person3 blue
....
我正在寻找如何创建一个"频率"每个唯一ID的分类变量red
,blue
,green
,然后展开这些列以记录每个唯一ID的计数。结果data.table看起来像这样:
dt
ID red blue green
person1 2 1 0
person2 2 1 2
...
我错误地认为使用data.table
开始此操作的正确方法是按组计算table()
,例如
dt[, counts :=table(category), by=ID]
但这似乎按组ID计算分类值的总数。这也无法解决我的问题,即扩大" data.table。
这样做的正确方法是什么?
答案 0 :(得分:1)
喜欢这个吗?
library(data.table)
library(dplyr)
dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)
如果您要将这些列添加到原始data.table
counts <- dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)
counts[is.na(counts)] <- 0
output <- merge(dt, counts, by = "ID")
答案 1 :(得分:1)
这是以命令式的方式完成的,可能是一种更干净,更实用的方式。
library(data.table)
library(dtplyr)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2"),
category = c("red", "red", "blue", "red", "red", "blue", "green", "green"))
ids <- unique(dt$ID)
categories <- unique(dt$category)
counts <- matrix(nrow=length(ids), ncol=length(categories))
rownames(counts) <- ids
colnames(counts) <- categories
for (i in seq_along(ids)) {
for (j in seq_along(categories)) {
count <- dt %>%
filter(ID == ids[i], category == categories[j]) %>%
nrow()
counts[i, j] <- count
}
}
然后:
>counts
## red blue green
##person1 2 1 0
##person2 2 1 2
答案 2 :(得分:1)
您可以将重塑库用于一行。
library(reshape2)
dcast(data=dt,
ID ~ category,
fun.aggregate = length,
value.var = "category")
ID blue green red
1 person1 1 0 2
2 person2 1 2 2
此外,如果您只需要一个简单的双向表,则可以使用内置的R table
函数。
table(dt$ID,dt$category)