R:将多个虚拟变量重新编码为单个变量,并用变量名替换相应的虚拟值

时间:2015-05-21 01:51:31

标签: r

我有一个数据集,其中包含14个互斥的调用类型,所有这些类型都被编码为虚拟变量。这是一个小样本:

dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), 
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("MON1_12", 
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", 
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "data.frame", row.names = c(NA, 
-10L))

我想将每个虚拟变量组合成一个名为" QUEUE"的新变量。取代" 1"与虚拟变量的名称相对应的虚拟变量。以下是一个示例:

dput(df2)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), QUEUE = structure(c(1L, 4L, 2L, 4L, 1L, 3L, 
3L, 5L, 5L, 4L), .Label = c("CLAIMS", "CONTENT", "CREDIT_CARD", 
"DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12", 
"WEEK1_53", "AGENT_ID", "CallsHandled", "QUEUE"), class = "data.frame", row.names = c(NA, 
-10L))

编辑以回应标记下来的问题:这是我今天下午在推荐时尝试过的示例数据框略有不同:

df$Queue <- as.factor(df$CONTENT + df$CLAIMS*2 + df$CREDIT_CARD*3 +  df$DEDUCT_BILL*4 + df$HCREFORM*5)
levels(df$Queue) <- c("CONTENT", "CLAIMS", "CREDIT_CARD","DEDUCT_BILL","HCREFORM")
View(df)

但我在队列中收到了NA列。所以,我在这里重新创建了另一个样本数据集。这个数据框足以代表我实际收到的内容,除了我有大约40个变量和200万行。当我按照上面尝试的方式运行&#34; df&#34;上面我得到以下不正确的结果:

dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), 
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Queue = structure(c(2L, 
1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CONTENT", 
"CLAIMS", "CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12", 
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", 
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM", "Queue"), row.names = c(NA, 
-10L), class = "data.frame")

我也尝试过:

df3 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))

但收到以下错误:&#34; data.frame错误(&#34; CLAIMS&#34;,字符(0),字符(0),&#34; DEDUCT_BILL&#34;,:   参数意味着不同的行数:1,0:

2 个答案:

答案 0 :(得分:1)

这应该产生预期的结果:

df2 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))

前提是任何一行中只有一个虚拟变量为1(在df的原始样本中不是这样)。

说明: df [1:4]选择要保留在输出中的第一列到第四列。然后使用QUEUE函数将列绑定到cbindQUEUE是通过迭代虚拟变量(第5列到第9列),逐行遍历数据集df并选择包含值1的列名来获得的。

答案 1 :(得分:1)

您可以使用max.col来获取第5列到第9列每行中值为“1”的列索引。('df'示例不正确,因为大多数行都是0纠正的是下面的)。

df$QUEUE <-  names(df)[-c(1:4)][max.col(df[-c(1:4)])]

或者你可以做到

df$QUEUE <-  names(df)[-(1:4)][(as.matrix(df[-(1:4)]) %*% 
                         seq_along(df[-(1:4)]))[,1]]

更新

根据编辑数据集'df',列5:9的某些行都是0,而在预期的结果中,它显示'QUEUE'为'CONTENT'。在这种情况下,我们可以先修改'CONTENT'列以更改行全部为0的值,然后应用上面的代码之一

 df$CONTENT[!rowSums(df[5:9])] <- 1
 df$QUEUE1 <-  names(df)[5:9][max.col(df[5:9])]
 df$QUEUE1
 #[1] "CLAIMS"      "CONTENT"     "CONTENT"     "DEDUCT_BILL" "CONTENT"    
 #[6] "CONTENT"     "CONTENT"     "CONTENT"     "CONTENT"     "CONTENT" 

数据

df <- structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), CONTENT = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0), CLAIMS = c(1, 
0, 0, 0, 1, 0, 0, 0, 0, 0), CREDIT_CARD = c(0, 0, 0, 0, 0, 1, 
1, 0, 0, 0), DEDUCT_BILL = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 1),
 HCREFORM = c(0, 
0, 0, 0, 0, 0, 0, 1, 1, 0)), .Names = c("MON1_12", "WEEK1_53", 
"AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", "CREDIT_CARD", 
"DEDUCT_BILL", "HCREFORM"), row.names = c(NA, -10L), class = "data.frame")