如何使用R data.table按组计算分类变量的频率/表?

时间:2017-06-05 16:45:14

标签: r dataframe data.table frequency

我有以下data.table与R

library(data.table)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2", ...), category = c("red", "red", "blue", "red", "red", "blue", "green", "green", ...))

dt
ID         category
person1    red
person1    red
person1    blue
person2    red
person2    red
person2    blue
person2    green
person2    green
person3    blue
....

我正在寻找如何创建一个"频率"每个唯一ID的分类变量redbluegreen,然后展开这些列以记录每个唯一ID的计数。结果data.table看起来像这样:

dt
ID        red    blue    green
person1   2      1       0
person2   2      1       2    
...

我错误地认为使用data.table开始此操作的正确方法是按组计算table(),例如

dt[, counts :=table(category), by=ID]

但这似乎按组ID计算分类值的总数。这也无法解决我的问题,即扩大" data.table。

这样做的正确方法是什么?

3 个答案:

答案 0 :(得分:1)

喜欢这个吗?

library(data.table)
library(dplyr)
dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)

如果您要将这些列添加到原始data.table

counts <- dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category) 
counts[is.na(counts)] <- 0
output <- merge(dt, counts, by = "ID")

答案 1 :(得分:1)

这是以命令式的方式完成的,可能是一种更干净,更实用的方式。

library(data.table)
library(dtplyr)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2"), 
                category = c("red", "red", "blue", "red", "red", "blue", "green", "green"))


ids <- unique(dt$ID)
categories <- unique(dt$category)
counts <- matrix(nrow=length(ids), ncol=length(categories))
rownames(counts) <- ids
colnames(counts) <- categories

for (i in seq_along(ids)) {
  for (j in seq_along(categories)) {
    count <- dt %>%
      filter(ID == ids[i], category == categories[j]) %>%
      nrow()

    counts[i, j] <- count
  }
}

然后:

>counts
##         red blue green
##person1   2    1     0
##person2   2    1     2

答案 2 :(得分:1)

您可以将重塑库用于一行。

library(reshape2)
dcast(data=dt,
      ID ~ category,
      fun.aggregate = length,
      value.var = "category")

       ID blue green red
1 person1    1     0   2
2 person2    1     2   2

此外,如果您只需要一个简单的双向表,则可以使用内置的R table函数。

table(dt$ID,dt$category)