匹配相似的字符串向量并返回非匹配元素

时间:2014-11-15 11:13:51

标签: r string

我有2个包含相似字符串向量的数据集(产品标题)。两个数据集中字符串之间的唯一区别是缺少/存在特殊字符。

现在,我的问题是匹配相应的字符串向量并返回不匹配的元素(在每种情况下都应该是特殊字符)。单个字符串中可以有许多不匹配的特殊字符。

例如我有2个文本:

Text 1: Analog Science Fiction and Fact February 1995
Text 2: Analog Science Fiction and Fact, February 1995

是否有R函数只返回非匹配元素?

这就是我解决问题的方法

S.vector <- strsplit(Acceptdata['Text.1'][1,],' ')
S.vector
# [[1]]
# [1] "Analog"   "Science"  "Fiction"  "and"      "Fact"     "February" "1995"    

F.vector <- strsplit(Acceptdata['Text.2'][1,],' ')
F.vector
# [[1]]
# [1] "Analog"   "Science"  "Fiction"  "and"      "Fact,"    "February" "1995"

l.S.vector <- tolower(S.vector)
l.F.vector <- tolower(F.vector)
grep("l.S.vector",l.F.vector,invert=T,value=T)
# [1] "c(\"analog\", \"science\", \"fiction\", \"and\", \"fact,\", \"february\", \"1995\")"

非常感谢任何帮助。

当我尝试为整个数据集(~500个向量)运行算法时,它会抛出一个错误,因为is.character(a)不是TRUE。

我遵循的程序:

common <- function(a,b) { 
  for (i in seq_along(a)) 
    for (j in seq_along(b)) 
    i2 <- strsplit(tolower(i),'') 
    j2 <- strsplit(tolower(j),'') 
    if(length(i2) < length(j2)) { 
      i2[(length(i2)+1):length(j2)] <- ' ' 
    } else if(length(i2) > length(j2)) { 
      b2[(length(b2)+1):length(a2)] <- ' ' 
    } 
    LCS(i2,j2) 
} 

z <- common(a,b) 
Error: is.character(a) is not TRUE

我知道哪里出错了?

1 个答案:

答案 0 :(得分:1)

我对你的预期输出完全清楚,但我认为这将有助于你实现目标。它使用 qualV 包中的LCS函数。

library("qualV")
common <- function(a,b) {
    a2 <- strsplit(a,'')[[1]]
    b2 <- strsplit(b,'')[[1]]
    if(length(a2) < length(b2)) {
        a2[(length(a2)+1):length(b2)] <- ' '
    } else if(length(a2) > length(b2)) {
        b2[(length(b2)+1):length(a2)] <- ' '
    }
    LCS(a2,b2)
}

以下是使用两个字符串的示例:

a <- 'Analog Science Fiction and Fact February 1995'
b <- 'Analog Science Fiction and Fact, February 1995'
z <- common(a,b)
paste0(z$LCS, collapse = '') # common string
# [1] "Analog Science Fiction and Fact February 1995"
z$b[which(!seq(1,max(z$vb)) %in% z$vb)] # non-matching elements in `b`
# [1] ","
z$a[which(!seq(1,max(z$va)) %in% z$va)] # non-matching elements in `a`
# character(0)

以下是使用两个差异较大的字符串的示例:

a <- 'Analog! SCIENCE Fiction and Fact Feb. 1995'
b <- 'Analog Science Fiction & Fact (February 1995)'
z <- common(a,b)
paste0(z$LCS, collapse = '') # common string
# [1] "Analog S Fiction  Fact Feb 1995"
z$b[which(!seq(1,max(z$vb)) %in% z$vb)] # non-matching elements in `b`
# [1] "c" "i" "e" "n" "c" "e" "&" "(" "r" "u" "a" "r" "y"
z$a[which(!seq(1,max(z$va)) %in% z$va)] # non-matching elements in `a`
# [1] "!" "C" "I" "E" "N" "C" "E" "a" "n" "d" "."