与data.table聚合后修改的因子级别顺序

时间:2014-01-22 03:39:01

标签: r data.table aggregation

我正在使用以下函数,grp与data.table聚合并遇到问题。

问题是因子变量fc_x的级别顺序在聚合后不会保持相同的顺序。 我的功能有问题,还是“正常”意味着它有解释?

grp <- function(x) {
  percentage = as.numeric(table(x)/length(x))
  list(x = factor(levels(x)),
       percentage = percentage,
       label = paste0( round( as.numeric(table(x)/length(x), 0 ) * 100 ), "%")
  )
}

set.seed(123)
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
            labels = c("0-50", "51-100", "+100"))

str(DT)
# Classes ‘data.table’ and 'data.frame':  100 obs. of  3 variables:
# $ x   : num  90.7 59.4 18 125.4 187.7 ...
# $ fac : Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ fc_x: Factor w/ 3 levels "0-50","51-100",..: 2 2 1 3 3 3 3 3 1 1 ...

levels(DT$fc_x)
# [1] "0-50"   "51-100" "+100"

AGG <- DT[, grp(fc_x), by=fac]

levels(AGG$x)
# [1] "+100"   "0-50"   "51-100"

修改

更改“1000”的“+100”提供了类似的结果

DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
               labels = c("0-50", "51-100", "1000"))

levels(DT$fc_x)
# [1] "0-50"   "51-100" "1000"

AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50"   "1000"   "51-100"

在cut()语句中使用ordered = TRUE可提供相同的结果

DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T, ordered = T,
               labels = c("0-50", "51-100", "1000"))

levels(DT$fc_x)
# [1] "0-50"   "51-100" "1000"

AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50"   "1000"   "51-100"

2 个答案:

答案 0 :(得分:3)

我认为问题是当你在函数中定义x时你没有提供标签所以它只是按字母顺序排列因子级别,所以我认为你只需要将标签添加到你的函数中。

DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), rigth = T, 
labels = c("0-50", "51-100",  "+100"))

factor(levels(DT$fc_x))
[1] 0-50   51-100 +100  
Levels: 0-50 +100 51-100

factor(levels(DT$fc_x),  labels = c("0-50", "51-100", "100+"))
[1] 0-50   +100   51-100
Levels: 0-50 51-100 +100


grp <- function(x) {
  percentage = as.numeric(table(x)/length(x))
  list(
       x = factor(levels(x), labels = levels(x)),
       percentage = percentage,
       label = paste0( round( as.numeric(table(x)/length(x), 0 ) * 100 ), "%")
  )
}

DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))

DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), rigth = T,
               labels = c("0-50", "51-100", "+100"))
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
[1] "0-50"   "51-100" "100+"  

答案 1 :(得分:0)

将grp函数的修改版本与真实数据集一起使用后,级别很好但在聚合后不匹配实际值。

我想出了这个,我相信更简单的解决方案将名称传递给表格结果。 如果我不使用as.numeric(table(...))我保留名字。

谢谢你帮助亚光,马修。我会接受您的回答,因为它很有帮助。

grp <- function(x) {
  percentage = data.frame(table(x)/length(x))
  list(x = factor(percentage[[1]]),
       percentage = percentage[[2]],
       label = paste0( round( as.numeric(percentage[[2]], 2 ) * 100 ) , "%")
  )
}