我有一个这样的清单:
mylist <- list(PP = c("PP 1", "OMITTED"),
IN01 = c("DID NOT PARTICIPATE", "PARTICIPATED", "OMITTED"),
RD1 = c("YES", "NO", "NOT REACHED", "INVALID", "OMITTED"),
RD2 = c("YES", "NO", "NOT REACHED", "NOT AN OPTION", "OMITTED"),
LOS = c("LESS THAN 3", "3 TO 100", "100 TO 500", "MORE THAN 500", "LOGICALLY NOT APPLICABLE", "OMITTED"),
COM = c("BAN", "SBAN", "RAL"),
VR1 = c("WITHIN 30", "WITHIN 200", "NOT AVAILABLE", "OMITTED"),
INF = c("A LOT", "SOME", "LITTLE OR NO", "NOT APPLICABLE", "OMITTED"),
IST = c("FULL-TIME", "PART-TIME", "FULL STAFFED", "NOT STAFFED", "LOGICALLY NOT APPLICABLE", "OMITTED"),
CMP = c("ALL", "MOST", "SOME", "NONE", "LOGICALLY NOT APPLICABLE", "OMITTED"))
我有另一个这样的列表:
matchlist <- list("INVALID", c("INVALID", "OMITTED OR INVALID"),
c("INVALID", "OMITTED"), "OMITTED", c("NOT REACHED", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED OR INVALID"),
c("Not applicable", "Not stated"), c("Not reached", "Not administered/missing by design", "Presented but not answered/invalid"),
c("Not administered/missing by design", "Presented but not answered/invalid"),
"OMITTED OR INVALID",
c("LOGICALLY NOT APPLICABLE", "OMITTED OR INVALID"),
c("NOT REACHED", "OMITTED"),
c("NOT APPLICABLE", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "NOT REACHED", "OMITTED"),
"NOT EXCLUDED", c("Default", "Not applicable", "Not stated"), c("Valid Skip", "Not Reached", "Not Applicable", "Invalid", "No Response"),
c("Not administered", "Omitted"),
c("NOT REACHED", "INVALID RESPONSE", "OMITTED"),
c("INVALID RESPONSE", "OMITTED"))
如您所见,matchlist
中的某些向量与mylist
中的向量部分匹配。在某些情况下,matchlist
中的向量与mylist
中的部分向量完全匹配。例如,RD1
中mylist
的最后一个值与matchlist
的第五个分量中的向量匹配,但RD2
与它不匹配,尽管存在常见值。 RD2
中mylist
中的值(&#34;未达到&#34;,&#34;不是选项&#34;,&#34; OMITTED&#34;)在一起按此顺序在matchlist
中的任何向量中都没有匹配项。 COM
中mylist
的值相同。
我想要实现的是将mylist
中每个向量中的元素与matchlist
中的每个向量进行比较,提取常见的值并匹配matchlist
中的值< strong>按相同顺序,并将它们存储在另一个列表中。期望的结果应如下所示:
$PP
[1] "OMITTED"
$IN01
[1] "OMITTED"
$RD1
[1] "NOT REACHED" "INVALID" "OMITTED"
$RD2
character(0)
$LOS
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
$COM
character(0)
$VR1
[1] "OMITTED"
$INF
[1] "NOT APPLICABLE" "OMITTED"
$IST
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
$CMP
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
到目前为止我尝试了什么:
使用intersect
lapply(mylist, function(i) {
intersect(i, lapply(matchlist, function(i) {i}))
})
它只返回matchlist
的每个向量中的最后一个值(&#34; OMITTED&#34;)。
使用match
到%in%
:
lapply(mylist, function(i) {
i[which(i %in% matchlist)]
})
仅为RD1
(&#34; INVALID&#34;,&#34; OMITTED&#34;)返回所需的结果,其余的只返回最后一个值(&#34; OMITTED&# 34;),但COM
除外。
使用mapply
和intersect
:
mapply(intersect, mylist, matchlist)
返回一个包含几乎所有内容的长列表,包括不应该存在的组合,以及不等长度的警告。
有人可以帮忙吗?
答案 0 :(得分:4)
以下是使用unlist
与matchlist
:
lapply(mylist, function(x) x[x %in% unlist(matchlist)])
输出(新列表):
$PP
[1] "OMITTED"
$IN01
[1] "OMITTED"
$RD1
[1] "NOT REACHED" "INVALID" "OMITTED"
$LOS
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
$COM
character(0)
$VR1
[1] "OMITTED"
$INF
[1] "NOT APPLICABLE" "OMITTED"
$IST
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
$CMP
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
答案 1 :(得分:3)
简单地写
lapply(mylist, intersect, unlist(matchlist))
也有效。
答案 2 :(得分:2)
lapply(mylist, function(i) {
unlist(sapply(i,function(x){if(any(grepl(paste0("^",x,"$"),matchlist))){x}}))
})
我在字符串之前和之后添加了“\ b”,因为“NO”可以导致找到“NOT”。使用grepl肯定不是最好的方式,因为另一个答案显示:)
答案 3 :(得分:1)
有一些非常简单/好的答案,但它们似乎都依赖于unlist
。我假设您需要在matchlist
内保留分组,因此将它们展开是没有意义的。这是一个没有它的解决方案,在您开始时使用双lapply
循环:
out <- lapply(mylist, function(this) {
mtch <- lapply(matchlist, intersect, this)
wh <- which.max(lengths(mtch))
if (length(wh)) mtch[[wh]] else character(0)
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr [1:3] "NOT REACHED" "INVALID" "OMITTED"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
它总是返回一个匹配数最多的向量,但是如果有(不知何故)多于一个,我认为它将保留自然顺序并返回所述长匹配中的第一个。 (问题是:“which.max
是否保留了自然顺序?”我认为它确实存在,但尚未经过验证。)
<强> 更新 强>
添加了约束,不仅需要matchlist
向量的存在和顺序,而且还没有交叉词。例如,如果评论中建议mylist$RD1
有"BLAH"
,那么它将与matchlist[[5]]
不再匹配。
检查一个向量到另一个向量的完美有序子集是有问题的(因此不是代码高尔夫冠军),并且通常难以扩展,因为我们没有简单的子集确定。有了这个警告,这个实现会做一些嵌套的*apply
函数......
(注意:评论中建议$RD1
应该返回character(0)
,但它确实"INVALID"
与matchlist
的单长组件之一匹配,所以它应该匹配,而不是更长的。)
out <- lapply(mylist, function(this) {
ind <- lapply(matchlist, function(a) which(this == a[1]))
perfectmatches <- mapply(function(ml, allis, this) {
length(ml) * any(sapply(allis, function(i) all(ml == this[ i + seq_along(ml) - 1 ])))
}, matchlist, ind, MoreArgs = list(this=this))
if (any(perfectmatches) > 0) {
wh <- which.max(perfectmatches)
return(matchlist[[wh]])
} else return(character(0))
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr "INVALID"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"