示例:
s <- "aaabaabaa"
p <- "aa"
我想返回4而不是3(即将初始"aa"
中"aaa"
个实例的数量计为2,而不是1)。
有没有解决方案?或者有没有办法计算R?
答案 0 :(得分:7)
我相信
find_overlaps <- function(p,s) {
gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
if (length(gg)==1 && gg==-1) 0 else length(gg)
}
find_overlaps("aa","aaabaabaa") ## 4
find_overlaps("not_there","aaabaabaa") ## 0
find_overlaps("aa","aaaaaaaa") ## 7
会做你想要的,更明确地表达为“查找字符串中重叠子串的数量”。
这是Finding the indexes of multiple/overlapping matching substrings
的一个小变化答案 1 :(得分:2)
substring
在这里可能很有用,可以通过拍摄每一对连续的字符。
( ss <- sapply(2:nchar(s), function(i) substring(s, i-1, i)) )
## [1] "aa" "aa" "ab" "ba" "aa" "ab" "ba" "aa"
sum(ss %in% p)
## [1] 4
答案 2 :(得分:1)
我需要一个相关的更一般性问题的答案。以下是我提出的推广Ben Bolker解决方案的方法:
my.data <- read.table(text = '
my.string my.cov
1.2... 1
.21111 2
..2122 3
...211 2
112111 4
212222 1
', header = TRUE, stringsAsFactors = FALSE)
desired.result.2ch <- read.table(text = '
my.string my.cov n.11 n.12 n.21 n.22
1.2... 1 0 0 0 0
.21111 2 3 0 1 0
..2122 3 0 1 1 1
...211 2 1 0 1 0
112111 4 3 1 1 0
212222 1 0 1 1 3
', header = TRUE, stringsAsFactors = FALSE)
desired.result.3ch <- read.table(text = '
my.string my.cov n.111 n.112 n.121 n.122 n.222 n.221 n.212 n.211
1.2... 1 0 0 0 0 0 0 0 0
.21111 2 2 0 0 0 0 0 0 1
..2122 3 0 0 0 1 0 0 1 0
...211 2 0 0 0 0 0 0 0 1
112111 4 1 1 1 0 0 0 0 1
212222 1 0 0 0 1 2 0 1 0
', header = TRUE, stringsAsFactors = FALSE)
find_overlaps <- function(s, my.cov, p) {
gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
if (length(gg)==1 && gg==-1) 0 else length(gg)
}
p <- c('11', '12', '21', '22', '111', '112', '121', '122', '222', '221', '212', '211')
my.output <- matrix(0, ncol = (nrow(my.data)+1), nrow = length(p))
for(i in seq(1,length(p))) {
my.data$p <- p[i]
my.output[i,1] <- p[i]
my.output[i,(2:(nrow(my.data)+1))] <-apply(my.data, 1, function(x) find_overlaps(x[1], x[2], x[3]))
apply(my.data, 1, function(x) find_overlaps(x[1], x[2], x[3]))
}
my.output
desired.result.2ch
desired.result.3ch
pre.final.output <- matrix(t(my.output[,2:7]), ncol=length(p), nrow=nrow(my.data))
final.output <- data.frame(my.data[,1:2], t(apply(pre.final.output, 1, as.numeric)))
colnames(final.output) <- c(colnames(my.data[,1:2]), paste0('x', p))
final.output
# my.string my.cov x11 x12 x21 x22 x111 x112 x121 x122 x222 x221 x212 x211
#1 1.2... 1 0 0 0 0 0 0 0 0 0 0 0 0
#2 .21111 2 3 0 1 0 2 0 0 0 0 0 0 1
#3 ..2122 3 0 1 1 1 0 0 0 1 0 0 1 0
#4 ...211 2 1 0 1 0 0 0 0 0 0 0 0 1
#5 112111 4 3 1 1 0 1 1 1 0 0 0 0 1
#6 212222 1 0 1 1 3 0 0 0 1 2 0 1 0
答案 3 :(得分:0)
整洁,我认为更具可读性的解决方案是
library(tidyverse)
PatternCount <- function(text, pattern) {
#Generate all sliding substrings
map(seq_len(nchar(text) - nchar(pattern) + 1),
function(x) str_sub(text, x, x + nchar(pattern) - 1)) %>%
#Test them against the pattern
map_lgl(function(x) x == pattern) %>%
#Count the number of matches
sum
}
PatternCount("aaabaabaa", "aa")
# 4