我有一个包含一些逻辑列的数据集,并希望用相应的列名替换“TRUE”的值。我问了一个类似的问题here,并且能够借助其他S / O用户的一些建议找出合适的解决方案。但是,该解决方案不使用data.table语法并复制整个数据集而不是通过引用替换,这非常耗时。
使用data.table语法执行此操作的最合适方法是什么?
我试过了:
# Load library
library(data.table)
# Create dummy data.table:
mydt <- data.table(id = c(1,2,3,4,5),
ptname = c("jack", "jill", "jo", "frankie", "claire"),
sex = c("m", "f", "f", "m", "f"), apple = c(T,F,F,T,T),
orange = c(F,T,F,T,F),
pear = c(T,T,T,T,F))
# View dummy data:
> mydt
id ptname sex apple orange pear
1: 1 jack m TRUE FALSE TRUE
2: 2 jill f FALSE TRUE TRUE
3: 3 jo f FALSE FALSE TRUE
4: 4 frankie m TRUE TRUE TRUE
5: 5 claire f TRUE FALSE FALSE
# Function to recode values in a data.table:
recode.multi <- function(datacol, oldval, newval) {
trans <- setNames(newval, oldval)
trans[ match(datacol, names(trans)) ]
}
# Get a list of all the logical columns in the data set:
logicalcols <- names(which(mydt[, sapply(mydt, is.logical)] == TRUE))
# Apply the function to convert 'TRUE' to the relevant column names:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(FALSE, TRUE),
newval = c("FALSE", names(.SD))), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE apple
2: 2 jill f FALSE apple apple
3: 3 jo f FALSE FALSE apple
4: 4 frankie m apple apple apple
5: 5 claire f apple FALSE FALSE
这是不正确的,因为它不是遍历替换值的每个列名,而只是回收第一个(在这种情况下为“apple”)。
此外,如果我颠倒旧值和新值的顺序,该函数会忽略第二个值的字符串替换,并在所有情况下使用前两个列名作为替换:
# Apply the function with order of old and new values reversed:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(TRUE, FALSE),
newval = c(names(.SD), "FALSE")), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple orange apple
2: 2 jill f orange apple apple
3: 3 jo f orange orange apple
4: 4 frankie m apple apple apple
5: 5 claire f apple orange orange
我确定我可能错过了一些简单的东西,但有人知道为什么函数不会遍历列名(以及如何编辑它来执行此操作)?
我的预期输出如下:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE pear
2: 2 jill f FALSE orange pear
3: 3 jo f FALSE FALSE pear
4: 4 frankie m apple orange pear
5: 5 claire f apple FALSE FALSE
另外,我们非常感谢任何其他简明data.table语法的建议。
答案 0 :(得分:3)
我们可以使用melt/dcast
方法
dcast(melt(mydt, id.var = c("id", "ptname", "sex"))[,
value1 := as.character(value)][(value), value1 := variable],
id + ptname + sex~variable, value.var = "value1")
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
或另一个选项是使用set
更高效
nm1 <- which(unlist(mydt[, lapply(.SD, class)])=="logical")
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
或评论中提到的其他选项是
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
更新:比较选项二和三(由于非逻辑列的数量,一个是不可能的),数据集包含18573行和650列,其中252列是逻辑运行,具有以下时序:
# Option 2:
nm1 <- which(unlist(mydt[, lapply(.SD, is.logical)]))
system.time(
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
)
# user system elapsed
# 0.61 0.00 0.61
# Option 3:
system.time(
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
)
#user system elapsed
#0.65 0.00 0.66
两者都明显快于不使用data.table语法的原始方法:
# Original approach:
logitrue <- which(mydt == TRUE, arr.ind = T)
system.time(
mydt[logitrue, ] <- colnames(mydt)[logitrue[,2]]
)
# user system elapsed
# 1.22 0.03 4.22