比较列并将输出放在附加列

时间:2015-06-11 15:09:42

标签: r

让我们从数据示例开始:

structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
    P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
    "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
    P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
    "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
    3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
    "Table,Shelf,Fridge"), class = "factor")), .Names = c("P1", 
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
"P2_location_all_predictors"), class = "data.frame", row.names = c(NA, 
-20L))

我想比较两对列。我想要联合的第一对是P1_location_subacon P2_location_subacon。第二对是P1_location_all_predictors P2_location_all_predictors

我想如何比较它们?在每一列中,您都有不同的水果/蔬菜“位置”。所以:

  1. 如果第一对中的位置相同(P1 / 2_location_subacon),我想将数字2放在附加列中。

  2. 如果第二对中的位置相同(P1 / 2_location_all_predictors),我想将数字1放在附加列中。那个有点复杂,因为不是所有的位置都必须相同。水果/蔬菜中至少有一个必须相同。

  3. 如果在两种情况下它们都不同0。您不会在示例数据中看到这种情况。

  4. 总结一下,我向您展示了我想要实现的输出:

    structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
    4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
    "Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
    4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
    1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
    "Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
        P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
        3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
        3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
        "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
        P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
        2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
        "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
        3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
        3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
        "Table,Shelf,Fridge"), class = "factor"), X = c(NA, NA, NA, 
        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
        NA, NA), Correct = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
        1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("P1", 
    "P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
    "P2_location_all_predictors", "X", "Correct"), class = "data.frame", row.names = c(NA, 
    -20L))
    

2 个答案:

答案 0 :(得分:4)

编辑:使用此处的反馈Test two columns of strings for match row-wise in R我改进了答案。

DT就是你的桌子:

library(data.table)
setDT(DT)
DT <- data.table(sapply(DT,as.character))

DT[, P1_location_all_predictors := gsub(",","|",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub(",","|",P1_location_subacon)]

DT[, match_all_pred := grepl(P1_location_all_predictors, P2_location_all_predictors) + 0, by = P1_location_all_predictors]
DT[, match_subacon := grepl(P1_location_subacon, P2_location_subacon), by = P1_location_subacon]


DT[, P1_location_all_predictors := gsub("\\|",",",P1_location_all_predictors)]
DT[, P1_location_subacon := gsub("\\|",",",P1_location_subacon)]

我选择了两列而不是0/1/2符号;它使代码不那么简单,因为你必须依赖嵌套的ifs。我还认为多列更好,因为您可以清楚地看到F/FT/FF/TT/T个案。

如果您必须创建0/1/2,则可以致电

DT[, MyCol := match_all_pred - match_subacon*match_all_pred+match_subacon*2]

假设subacon取代所有位置。

答案 1 :(得分:2)

这是另一种方式:

myData <- data.frame(sapply(myData, as.character), stringsAsFactors=FALSE)

doesIntersect <- function(setA, setB) {length(intersect(setA,setB)) > 0}

myData$Correct <- 0
myData$Correct[mapply(doesIntersect, strsplit(myData$P1_location_all_predictors, ","), strsplit(myData$P2_location_all_predictors, ","))] <- 1
myData$Correct[mapply(setequal, strsplit(myData$P1_location_subacon, ","), strsplit(myData$P2_location_subacon, ","))] <- 2

> myData$Correct
[1] 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2