两个列表中矢量的元素的部分交集

时间:2018-03-14 15:36:14

标签: r list match intersection

我有一个这样的清单:

mylist <- list(PP = c("PP 1", "OMITTED"),
           IN01 = c("DID NOT PARTICIPATE", "PARTICIPATED", "OMITTED"),                     
           RD1 = c("YES", "NO", "NOT REACHED", "INVALID", "OMITTED"),
           RD2 = c("YES", "NO", "NOT REACHED", "NOT AN OPTION", "OMITTED"),
           LOS = c("LESS THAN 3", "3 TO 100", "100 TO 500", "MORE THAN 500", "LOGICALLY NOT APPLICABLE", "OMITTED"),
           COM = c("BAN", "SBAN", "RAL"), 
           VR1 = c("WITHIN 30", "WITHIN 200", "NOT AVAILABLE", "OMITTED"),                         
           INF = c("A LOT", "SOME", "LITTLE OR NO", "NOT APPLICABLE", "OMITTED"),               
           IST = c("FULL-TIME", "PART-TIME", "FULL STAFFED", "NOT STAFFED", "LOGICALLY NOT APPLICABLE", "OMITTED"),
           CMP = c("ALL", "MOST", "SOME", "NONE", "LOGICALLY NOT APPLICABLE", "OMITTED"))

我有另一个这样的列表:

matchlist <- list("INVALID", c("INVALID", "OMITTED OR INVALID"),
c("INVALID", "OMITTED"), "OMITTED", c("NOT REACHED", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED OR INVALID"),
c("Not applicable", "Not stated"), c("Not reached", "Not administered/missing by design", "Presented but not answered/invalid"),
c("Not administered/missing by design", "Presented but not answered/invalid"),
"OMITTED OR INVALID",
c("LOGICALLY NOT APPLICABLE", "OMITTED OR INVALID"),
c("NOT REACHED", "OMITTED"),
c("NOT APPLICABLE", "OMITTED"), 
c("LOGICALLY NOT APPLICABLE", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "NOT REACHED", "OMITTED"),
"NOT EXCLUDED", c("Default", "Not applicable", "Not stated"), c("Valid Skip", "Not Reached", "Not Applicable", "Invalid", "No Response"),
c("Not administered", "Omitted"),
c("NOT REACHED", "INVALID RESPONSE", "OMITTED"),
c("INVALID RESPONSE", "OMITTED"))

如您所见,matchlist中的某些向量与mylist中的向量部分匹配。在某些情况下,matchlist中的向量与mylist中的部分向量完全匹配。例如,RD1mylist的最后一个值与matchlist的第五个分量中的向量匹配,但RD2与它不匹配,尽管存在常见值。 RD2mylist中的值(&#34;未达到&#34;,&#34;不是选项&#34;,&#34; OMITTED&#34;)在一起按此顺序matchlist中的任何向量中都没有匹配项。 COMmylist的值相同。

我想要实现的是将mylist中每个向量中的元素与matchlist中的每个向量进行比较,提取常见的值并匹配matchlist中的值< strong>按相同顺序,并将它们存储在另一个列表中。期望的结果应如下所示:

$PP
[1] "OMITTED"

$IN01
[1] "OMITTED"

$RD1
[1] "NOT REACHED" "INVALID" "OMITTED"

$RD2
character(0)

$LOS
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"

$COM
character(0)

$VR1
[1] "OMITTED"

$INF
[1] "NOT APPLICABLE" "OMITTED"

$IST
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"

$CMP
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"

到目前为止我尝试了什么:

使用intersect

lapply(mylist, function(i) {
  intersect(i, lapply(matchlist, function(i) {i}))
})

它只返回matchlist的每个向量中的最后一个值(&#34; OMITTED&#34;)。

使用match%in%

lapply(mylist, function(i) {
  i[which(i %in% matchlist)]
})

仅为RD1(&#34; INVALID&#34;,&#34; OMITTED&#34;)返回所需的结果,其余的只返回最后一个值(&#34; OMITTED&# 34;),但COM除外。

使用mapplyintersect

mapply(intersect, mylist, matchlist)

返回一个包含几乎所有内容的长列表,包括不应该存在的组合,以及不等长度的警告。

有人可以帮忙吗?

4 个答案:

答案 0 :(得分:4)

以下是使用unlistmatchlist

的简单解决方案
lapply(mylist, function(x) x[x %in% unlist(matchlist)])

输出(新列表):

$PP
[1] "OMITTED"

$IN01
[1] "OMITTED"

$RD1
[1] "NOT REACHED" "INVALID"     "OMITTED"    

$LOS
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"                 

$COM
character(0)

$VR1
[1] "OMITTED"

$INF
[1] "NOT APPLICABLE" "OMITTED"       

$IST
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"                 

$CMP
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"                 

答案 1 :(得分:3)

简单地写

lapply(mylist, intersect, unlist(matchlist))

也有效。

答案 2 :(得分:2)

lapply(mylist, function(i) {
  unlist(sapply(i,function(x){if(any(grepl(paste0("^",x,"$"),matchlist))){x}}))
})

我在字符串之前和之后添加了“\ b”,因为“NO”可以导致找到“NOT”。使用grepl肯定不是最好的方式,因为另一个答案显示:)

答案 3 :(得分:1)

有一些非常简单/好的答案,但它们似乎都依赖于unlist。我假设您需要在matchlist内保留分组,因此将它们展开是没有意义的。这是一个没有它的解决方案,在您开始时使用双lapply循环:

out <- lapply(mylist, function(this) {
  mtch <- lapply(matchlist, intersect, this)
  wh <- which.max(lengths(mtch))
  if (length(wh)) mtch[[wh]] else character(0)
})
str(out)
# List of 9
#  $ PP  : chr "OMITTED"
#  $ IN01: chr "OMITTED"
#  $ RD1 : chr [1:3] "NOT REACHED" "INVALID" "OMITTED"
#  $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ COM : chr(0) 
#  $ VR1 : chr "OMITTED"
#  $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
#  $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"

它总是返回一个匹配数最多的向量,但是如果有(不知何故)多于一个,我认为它将保留自然顺序并返回所述长匹配中的第一个。 (问题是:which.max是否保留了自然顺序?”我认为它确实存在,但尚未经过验证。)

<强> 更新

添加了约束,不仅需要matchlist向量的存在和顺序,而且还没有交叉词。例如,如果评论中建议mylist$RD1"BLAH",那么它将与matchlist[[5]]不再匹配。

检查一个向量到另一个向量的完美有序子集是有问题的(因此不是代码高尔夫冠军),并且通常难以扩展,因为我们没有简单的子集确定。有了这个警告,这个实现会做一些嵌套的*apply函数......

(注意:评论中建议$RD1应该返回character(0),但它确实"INVALID"matchlist的单长组件之一匹配,所以它应该匹配,而不是更长的。)

out <- lapply(mylist, function(this) {
  ind <- lapply(matchlist, function(a) which(this == a[1]))
  perfectmatches <- mapply(function(ml, allis, this) {
    length(ml) * any(sapply(allis, function(i) all(ml == this[ i + seq_along(ml) - 1 ])))
  }, matchlist, ind, MoreArgs = list(this=this))
  if (any(perfectmatches) > 0) {
    wh <- which.max(perfectmatches)
    return(matchlist[[wh]])
  } else return(character(0))
})
str(out)
# List of 9
#  $ PP  : chr "OMITTED"
#  $ IN01: chr "OMITTED"
#  $ RD1 : chr "INVALID"
#  $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ COM : chr(0) 
#  $ VR1 : chr "OMITTED"
#  $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
#  $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"