如何按群集汇总数据

时间:2014-11-24 06:53:16

标签: r data.table

假设我有以下数据:

library(data.table)    
set.seed(200)
data <- data.table(income=runif(20, 1000,8000), gender=sample(0:1,20, T), asset=runif(20, 10000,80000),education=sample(1:4,20,T), cluster = sample(1:4, 20, T))

我的数据包含连续变量和分类变量。我想根据集群变量汇总数据,如下所示:

连续变量(收入和资产):使用mean,因此我应用了

data[,lapply(.SD, mean), by = cluster, .SDcols = c(1,3)]

分类变量(性别和教育):我用过

table(data[,gender, by = cluster])/rowSums(table(data[,gender, by = cluster]))

table(data[,education, by = cluster])/rowSums(table(data[,education, by = cluster]))

我认为我的代码效率不高。

你能不能给我建议如何处理这个案子?

2 个答案:

答案 0 :(得分:2)

我这样做:

data[, .N, by=.(gender, cluster)][, .(gender, ratio = N/sum(N)), by=cluster]
data[, .N, by=.(education, cluster)][, .(education, ratio = N/sum(N)), by=cluster]

答案 1 :(得分:1)

您可以为for变量使用categorical循环

res <- list()
for(i in c('gender', 'education')){
   res[[i]] <- prop.table(table(cbind(data[,'cluster', with=FALSE], 
                           data[,i, with=FALSE])), margin=1)
}

res

或者

lapply(data[,c('gender','education'), with=FALSE], function(x)
         prop.table(table(cbind(data[,'cluster', with=FALSE],x)), margin=1))