R- DataTable计算分类变量的频率,并将每个变量显示为

时间:2017-05-17 18:37:40

标签: r data.table

我创建了一个名为DT的虚拟数据表。我正在尝试计算容量(数字)的总和,计算每个ID中代码和状态(分类)的频率。为了最终结果,我想在每个唯一ID中显示容量,A,B,C ......的频率和不同状态的总和。因此,列名称将为ID,total.Cap,A,B,C ... AZ,CA ..

DT <- data.table(ID = rep(1:500,100),
            Capacity = sample(1:1000, size = 50000, replace =T),
            Code = sample(LETTERS[1:26], 50000, replace = T),
            State = rep(c("AZ","CA","PA","NY","WA","SD"), 50000))


The format of result will like the table below: 
ID total.Cap  A   B   C  ...   AZ  CA ...
1   28123    10   25  70 ...   29  ...
2   32182    20   42  50  ...  30  ...
3

我试图使用ddply,melt和dcast ..但结果并没有像我想的那样出现。任何人都可以给我一些关于如何构建表格的提示吗?谢谢!

2 个答案:

答案 0 :(得分:1)

您可以使用三个单独的data.table语句构建总计,状态计数和代码计数,然后加入它们。在状态和代码上,您可以使用dcast将每个状态/代码转换为一列,并在每个状态/代码中包含计数。

library(data.table)

totals <- DT[, list(total.Cap = sum(Capacity)), by = "ID"]
states <- dcast(DT, ID ~ State)
codes <- dcast(DT, ID ~ Code)

然后,您可以将三个表连接在一起:

result <- setkey(totals, "ID")[states, ][codes, ]

这导致表格如下:

      ID total.Cap  AZ  CA  NY  PA  SD  WA  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U
  1:   1    287526 200   0   0 200   0 200 12 18 24 42 12 30 30 18  6 36 24  6 18 24 30 24  6 24 36 18 30
  2:   2    293838   0 200 200   0 200   0 18 24 42 30 30 12 24  6 24 12 48 42 18 18 42 24 24 24 12 18 24
  3:   3    279450 200   0   0 200   0 200 24 18 24  6 12 12 18 12 12 30 24 18 54 30  6 42 18 30 24 24 18
  4:   4    298200   0 200 200   0 200   0 30 30 36 30 36 24 24 18 24 18 30 30 30 24  6 30 18  6 18 18 18
  5:   5    294084 200   0   0 200   0 200 18  6 24 12 42 12 18 42 18 18 18 18 24 24 30 18 30 24  6 30 24

请注意,如果你有很多像State和Code这样的列,你可以先把它们熔化一下就可以完成所有这些:

# replace State and Code with the categorical variables you want
melted <- melt(DT, measure.vars = c("State", "Code"))
state_codes <- dcast(melted, ID ~ value)
setkey(totals, "ID")[state_codes, ]

请注意,您仍然需要加入总计,并且这不会保留列的顺序,例如&#34;状态然后代码&#34;反之亦然。

答案 1 :(得分:0)

这会在三个单独的数据表中创建total.CapCodeState摘要元素,然后按ID合并它们:

# Storing intermediate pieces
  total_cap <- DT[, j = list(total.Cap = sum(Capacity)), by = ID]
  code <- dcast(DT[, .N, by = c("ID", "Code")], ID ~ Code, fill = 0)
  state <- dcast(DT[, .N, by = c("ID", "State")], ID ~ State, fill = 0)

  mytable <- merge(total_cap, code, by = "ID")
  mytable <- merge(mytable, state, by = "ID")
  mytable

# As a one-liner
  mytable <- merge(
               merge(DT[, j = list(total.Cap = sum(Capacity)), by = ID],
                     dcast(DT[, .N, by = c("ID", "Code")], ID ~ Code, fill = 0),
                     by = "ID"),
               dcast(DT[, .N, by = c("ID", "State")], ID ~ State, fill = 0),
               by = "ID")
  mytable