我有以下字符串:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
我想在A,G和N的出现次数达到某个值(例如3)时就切断字符串。在这种情况下,结果应该是:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
我尝试使用stringi
,stringr
和正则表达式,但是我无法弄清楚。
答案 0 :(得分:9)
这是使用strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
或者在tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
答案 1 :(得分:9)
您可以通过从 stringr 包中简单调用.gitrc
来完成任务:
str_extract
正则表达式模式的library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
部分表示要查找零个或多个不是A,G或N的连续字符,然后是A,G或N的一个实例。带括号的附加换行和括号,例如[^AGN]*[AGN]
,表示连续查找该模式三次。您可以通过更改花括号中的整数来更改要查找的A,G,N的出现次数:
([^AGN]*[AGN]){3}
有几种使用基本R函数完成任务的方法。一种是使用str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
,然后使用regexpr
:
regmatches
或者,您可以使用m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
:
sub
答案 2 :(得分:6)
使用gregexpr
识别模式的位置,然后提取第n个位置(3
),并使用1
将从subset
到该第n个位置的所有子字符串分类。
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
如果一个字符串没有3个匹配项,它将生成NA
,因此您只需要在最终结果上使用na.omit
。
答案 3 :(得分:2)
这只是Maurits Evers neat solution的strsplit
版。
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
或者,略有不同,没有strsplit
和paste
:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
答案 4 :(得分:0)
有趣的问题。我创建了一个函数(请参见下文)来解决您的问题。假定您的任何字符串中只有字母,没有特殊字符。
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"