你知道任何现成的方法来获得长度和两个字符串的重叠吗?但只有year | value | user
-------------------
2017 | 150 | john
2018 | 163 | jack
2003 | 125 | john
2018 | 175 | jack
,可能来自2003:
125 | john
2017:
150 | john
2018:
175 | jack
163 | jack
的内容?不幸的是,我看到这里没有成功。
R
其他例子:
stringr
///
str1 <- 'ABCDE'
str2 <- 'CDEFG'
str_overlap(str1, str2)
'CDE'
str_overlap_len(str1, str2)
3
///是两个解决方案,总是选择始终重叠
str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
str_overlap(str1, str2)
'CCTG'
str_overlap_len(str1, str2)
4
我很想知道自制的小功能,例如this one?
答案 0 :(得分:2)
在我看来,你(OP)并不十分关心代码的性能,但对于没有现成功能解决它的潜在approch更感兴趣。所以这是我想出来计算最长公共子串的一个例子。我必须注意,这只返回找到的第一个最大公共子字符串,即使可能有几个相同的长度。这是您可以修改以满足您的需求。请不要指望这超级快 - 它不会。
foo <- function(str1, str2, ignore.case = FALSE, verbose = FALSE) {
if(ignore.case) {
str1 <- tolower(str1)
str2 <- tolower(str2)
}
if(nchar(str1) < nchar(str2)) {
x <- str2
str2 <- str1
str1 <- x
}
x <- strsplit(str2, "")[[1L]]
n <- length(x)
s <- sequence(seq_len(n))
s <- split(s, cumsum(s == 1L))
s <- rep(list(s), n)
for(i in seq_along(s)) {
s[[i]] <- lapply(s[[i]], function(x) {
x <- x + (i-1L)
x[x <= n]
})
s[[i]] <- unique(s[[i]])
}
s <- unlist(s, recursive = FALSE)
s <- unique(s[order(-lengths(s))])
i <- 1L
len_s <- length(s)
while(i < len_s) {
lcs <- paste(x[s[[i]]], collapse = "")
if(verbose) cat("now checking:", lcs, "\n")
check <- grepl(lcs, str1, fixed = TRUE)
if(check) {
cat("the (first) longest common substring is:", lcs, "of length", nchar(lcs), "\n")
break
} else {
i <- i + 1L
}
}
}
str1 <- 'ABCDE'
str2 <- 'CDEFG'
foo(str1, str2)
# the (first) longest common substring is: CDE of length 3
str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
foo(str1, str2)
# the (first) longest common substring is: CCTG of length 4
str1 <- 'foobarandfoo'
str2 <- 'barand'
foo(str1, str2)
# the (first) longest common substring is: barand of length 6
str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'
foo(str1, str2)
# the (first) longest common substring is: ABCDE of length 5
set.seed(2018)
str1 <- paste(sample(c(LETTERS, letters), 500, TRUE), collapse = "")
str2 <- paste(sample(c(LETTERS, letters), 250, TRUE), collapse = "")
foo(str1, str2, ignore.case = TRUE)
# the (first) longest common substring is: oba of length 3
foo(str1, str2, ignore.case = FALSE)
# the (first) longest common substring is: Vh of length 2
答案 1 :(得分:1)
希望这会有所帮助:
library(stringr)
larsub<-function(x) {
a<-x[1]
b<-x[2]
# get all forward substrings of a
for(n in seq(1,nchar(a)))
{
sb<-unique(combn(strsplit(a, "")[[1]],n, FUN=paste, collapse=""))
if(length(unlist(str_extract_all(b,sb)))==0){
r<-prior
return(r)
}
prior<-unlist(str_extract_all(b,sb))
}
}
c1<-larsub(c('ABCD','BCDE'))
c2<-larsub(c('ABDFD','BCDE'))
c3<-larsub(c('CDEWQ','DEQ'))
c4<-larsub(c('BNEOYJBELMGY','BELM'))
print(c1)
print(c2)
print(c3)
print(c4)
输出:
> print(c1)
[1] "BCD"
> print(c2)
[1] "B" "D"
> print(c3)
[1] "DEQ"
> print(c4)
[1] "BELM"
`
Diclaimer:逻辑是从这里的lcs答案中借来的:longest common substring in R finding non-contiguous matches between the two strings由@Rick Scriven发布