有没有办法比较一行的“任何值”是否与上面一行的“任何值”相同 - 无论订单是?下面是一个非常随机的输入数据表。
DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
B=c("a","b","c","c","f","g",NA,"j",NA),
C=c("a","b","c","b","g","h",NA,NA,NA))
> DT
A B C
1: a a a
2: a b b
3: b c c
4: d c b
5: e f g
6: f g h
7: h NA NA
8: i j NA
9: j NA NA
我想添加一个D行,用于将行与上面的行进行比较,并比较两行的任何值是否相同(无论顺序如何)。所以期望的输出是:
> DT
A B C D
1: a a a 0 #No row above to compare; could be either NA or 0
2: a b b 1 #row 2 has "a", which is in row 1; returns 1
3: b c c 1 #row 3 has "b", which is in row 2; returns 1
4: d c b 1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e f g 0 #row 5 has nothing that is in row 4; returns 0
6: f g h 1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA 1 #row 7 has "h", which is in row 6; returns 1
8: i j NA 0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA 1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)
主要思想是我想将行(或向量)与另一行(向量)进行比较,并且如果每行(向量)中的任何元素都是相同的,则将两行定义为相同。 (不重复比较每个元素)
答案 0 :(得分:4)
我们可以通过获取数据集的lead
行,每行paste
,使用{{1}检查原始数据集的paste
行的任何模式来执行此操作}和grepl
,然后Map
并转换为unlist
integer
或者我们可以使用DT[, D := {
v1 <- do.call(paste, .SD)
v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
v2N <- gsub("NA\\|*|\\|*NA", "", v2)
v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
as.integer(head(c(FALSE, v3), -1))
}]
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
split
并进行比较
Map
答案 1 :(得分:3)
这是另一种方法。它可能不适用于大型data.tables,因为它使用的by=1:nrow(DT)
往往很慢。
DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
by=1:nrow(DT)]$V1)]
此处,[, c(.SD, shift(.SD))]
创建data.frame的副本,包含滞后变量(cbinded)。然后第二个链与原始data.table中的未列出变量和移位的data.table相交。 NA被分配0并且非NA被分配1并且这些结果被求和。对复制的data.table的每一行都会执行此操作。总和使用$v1
提取,并使用sign
转换为二进制(0和1)。
返回
DT
A B C D
1: a a a 0
2: a b b 1
3: b c c 1
4: d c b 1
5: e f g 0
6: f g h 1
7: h NA NA 1
8: i j NA 0
9: j NA NA 1
答案 2 :(得分:2)
我会按照表格的索引(减去最后一个)来做一个顺利的事情:
compare <- function(i) {
row1 <- as.character(DT[i,])
row2 <- as.character(DT[i+1,])
return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}
result <- sapply(1:(nrow(DT) - 1), compare)
这将返回逻辑向量,因此如果您希望获得整数,请将compare
的输出包装在as.numeric()
答案 3 :(得分:2)
以下是使用base
的{{1}} R解决方案:
intersect
答案 4 :(得分:2)
这是一个使用data.table连接的无循环方法:
DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]
它看起来有些复杂,但对于包含三列的100万行数据集,id确实表现得相当不错。
答案 5 :(得分:1)
此解决方案将两行与%in%
(unlist()
之后)进行比较:
DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
# A B C result
#1: a a a NA
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
答案 6 :(得分:1)
使用intersect
和mapply
的组合,您可以:
#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))
#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])
#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
SIMPLIFY = TRUE)
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1