我正在使用以下函数,grp与data.table
聚合并遇到问题。
问题是因子变量fc_x
的级别顺序在聚合后不会保持相同的顺序。
我的功能有问题,还是“正常”意味着它有解释?
grp <- function(x) {
percentage = as.numeric(table(x)/length(x))
list(x = factor(levels(x)),
percentage = percentage,
label = paste0( round( as.numeric(table(x)/length(x), 0 ) * 100 ), "%")
)
}
set.seed(123)
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
labels = c("0-50", "51-100", "+100"))
str(DT)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 3 variables:
# $ x : num 90.7 59.4 18 125.4 187.7 ...
# $ fac : Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ fc_x: Factor w/ 3 levels "0-50","51-100",..: 2 2 1 3 3 3 3 3 1 1 ...
levels(DT$fc_x)
# [1] "0-50" "51-100" "+100"
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "+100" "0-50" "51-100"
修改
更改“1000”的“+100”提供了类似的结果
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T,
labels = c("0-50", "51-100", "1000"))
levels(DT$fc_x)
# [1] "0-50" "51-100" "1000"
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50" "1000" "51-100"
在cut()语句中使用ordered = TRUE可提供相同的结果
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), right = T, ordered = T,
labels = c("0-50", "51-100", "1000"))
levels(DT$fc_x)
# [1] "0-50" "51-100" "1000"
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
# [1] "0-50" "1000" "51-100"
答案 0 :(得分:3)
我认为问题是当你在函数中定义x时你没有提供标签所以它只是按字母顺序排列因子级别,所以我认为你只需要将标签添加到你的函数中。
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), rigth = T,
labels = c("0-50", "51-100", "+100"))
factor(levels(DT$fc_x))
[1] 0-50 51-100 +100
Levels: 0-50 +100 51-100
factor(levels(DT$fc_x), labels = c("0-50", "51-100", "100+"))
[1] 0-50 +100 51-100
Levels: 0-50 51-100 +100
grp <- function(x) {
percentage = as.numeric(table(x)/length(x))
list(
x = factor(levels(x), labels = levels(x)),
percentage = percentage,
label = paste0( round( as.numeric(table(x)/length(x), 0 ) * 100 ), "%")
)
}
DT <- data.table(x = rnorm(100, 100, 50), fac = factor(letters[1:10]))
DT$fc_x <- cut(DT$x, breaks = c(0, 50, 100, 10e5), rigth = T,
labels = c("0-50", "51-100", "+100"))
AGG <- DT[, grp(fc_x), by=fac]
levels(AGG$x)
[1] "0-50" "51-100" "100+"
答案 1 :(得分:0)
将grp函数的修改版本与真实数据集一起使用后,级别很好但在聚合后不匹配实际值。
我想出了这个,我相信更简单的解决方案将名称传递给表格结果。 如果我不使用as.numeric(table(...))我保留名字。
谢谢你帮助亚光,马修。我会接受您的回答,因为它很有帮助。
grp <- function(x) {
percentage = data.frame(table(x)/length(x))
list(x = factor(percentage[[1]]),
percentage = percentage[[2]],
label = paste0( round( as.numeric(percentage[[2]], 2 ) * 100 ) , "%")
)
}