查找字符串中的重叠长度

时间:2018-02-09 07:54:51

标签: r string bioinformatics overlap dna-sequence

你知道任何现成的方法来获得长度和两个字符串的重叠吗?但只有year | value | user ------------------- 2017 | 150 | john 2018 | 163 | jack 2003 | 125 | john 2018 | 175 | jack ,可能来自2003: 125 | john 2017: 150 | john 2018: 175 | jack 163 | jack 的内容?不幸的是,我看到这里没有成功。

R

其他例子:

stringr

///

str1 <- 'ABCDE'
str2 <- 'CDEFG'

str_overlap(str1, str2)
'CDE'

str_overlap_len(str1, str2)
3

///是两个解决方案,总是选择始终重叠

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'

str_overlap(str1, str2)
'CCTG'

str_overlap_len(str1, str2)
4

我很想知道自制的小功能,例如this one

2 个答案:

答案 0 :(得分:2)

在我看来,你(OP)并不十分关心代码的性能,但对于没有现成功能解决它的潜在approch更感兴趣。所以这是我想出来计算最长公共子串的一个例子。我必须注意,这只返回找到的第一个最大公共子字符串,即使可能有几个相同的长度。这是您可以修改以满足您的需求。请不要指望这超级快 - 它不会。

foo <- function(str1, str2, ignore.case = FALSE, verbose = FALSE) {

  if(ignore.case) {
    str1 <- tolower(str1)
    str2 <- tolower(str2)
  }

  if(nchar(str1) < nchar(str2)) {
    x <- str2
    str2 <- str1
    str1 <- x
  }

  x <- strsplit(str2, "")[[1L]]
  n <- length(x)
  s <- sequence(seq_len(n))
  s <- split(s, cumsum(s == 1L))
  s <- rep(list(s), n)

  for(i in seq_along(s)) {
    s[[i]] <- lapply(s[[i]], function(x) {
      x <- x + (i-1L)
      x[x <= n]
    })
    s[[i]] <- unique(s[[i]])
  }

  s <- unlist(s, recursive = FALSE)
  s <- unique(s[order(-lengths(s))])

  i <- 1L
  len_s <- length(s)
  while(i < len_s) {
    lcs <- paste(x[s[[i]]], collapse = "")
    if(verbose) cat("now checking:", lcs, "\n")
    check <- grepl(lcs, str1, fixed = TRUE)
    if(check) {
      cat("the (first) longest common substring is:", lcs, "of length", nchar(lcs), "\n")
      break
    } else {
      i <- i + 1L 
    }
  }
}

str1 <- 'ABCDE'
str2 <- 'CDEFG'
foo(str1, str2)
# the (first) longest common substring is: CDE of length 3 

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
foo(str1, str2)
# the (first) longest common substring is: CCTG of length 4

str1 <- 'foobarandfoo'
str2 <- 'barand'
foo(str1, str2)
# the (first) longest common substring is: barand of length 6 

str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'
foo(str1, str2)
# the (first) longest common substring is: ABCDE of length 5 


set.seed(2018)
str1 <- paste(sample(c(LETTERS, letters), 500, TRUE), collapse = "")
str2 <- paste(sample(c(LETTERS, letters), 250, TRUE), collapse = "")

foo(str1, str2, ignore.case = TRUE)
# the (first) longest common substring is: oba of length 3 

foo(str1, str2, ignore.case = FALSE)
# the (first) longest common substring is: Vh of length 2 

答案 1 :(得分:1)

希望这会有所帮助:

library(stringr)

larsub<-function(x) {
  a<-x[1]
  b<-x[2]
  # get all forward substrings of a
  for(n in seq(1,nchar(a)))
    {
    sb<-unique(combn(strsplit(a, "")[[1]],n, FUN=paste, collapse=""))
    if(length(unlist(str_extract_all(b,sb)))==0){ 
      r<-prior
      return(r)
      }
    prior<-unlist(str_extract_all(b,sb))
    }

}

c1<-larsub(c('ABCD','BCDE'))
c2<-larsub(c('ABDFD','BCDE'))
c3<-larsub(c('CDEWQ','DEQ'))
c4<-larsub(c('BNEOYJBELMGY','BELM'))
print(c1)
print(c2)
print(c3)
print(c4)

输出:

> print(c1) [1] "BCD" > print(c2) [1] "B" "D" > print(c3) [1] "DEQ" > print(c4) [1] "BELM" `

Diclaimer:逻辑是从这里的lcs答案中借来的:longest common substring in R finding non-contiguous matches between the two strings由@Rick Scriven发布