基本上,标题说的是。给定一个字符串,我需要从中提取不是首字符后跟空格的所有内容。所以,给定这个字符串
"420 species of grass"
我想得到
"species of grass"
但是,给定一个字符串,其开头不是数字,就像这样
"The clock says it is 420"
或数字不带空格的字符串,例如
"It is 420 already"
我想找回相同的字符串,并保留数字
"The clock says it is 420"
"It is 420 already"
匹配前导数字后跟空格可以正常工作
library(stringr)
str_extract_all("420 species of grass", "^\\d+(?=\\s)")
[[1]]
[1] "420"
> str_extract_all("The clock says it is 420", "^\\d+(?=\\s)")
[[1]]
character(0)
> str_extract_all("It is 420 already", "^\\d+(?=\\s)")
[[1]]
character(0)
但是,当我尝试匹配除以外的任何数字时,它都不是:
> str_extract_all("420 species of grass", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "species" "of" "grass"
> str_extract_all("The clock says it is 420", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "The" "clock" "says" "it" "is"
> str_extract_all("It is 420 already", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "It" "is" "already"
该正则表达式似乎匹配除数字和空格以外的任何内容。
我该如何解决?
答案 0 :(得分:2)
我认为@Douglas的答案更为简洁,不过,我认为您的实际情况会更加复杂,您可能需要检查?regexpr
才能确定特定模式的开始位置。
使用for
循环的方法如下:
list <- list("420 species of grass",
"The clock says it is 420",
"It is 420 already")
extract <- function(x) {
y <- vector('list', length(x))
for (i in seq_along(x)) {
if (regexpr("420", x[[i]])[[1]] > 1) {
y[[i]] <- x[[i]]
}
else{
y[[i]] <- substr(x[[i]], (regexpr(" ", x[[i]])[[1]] + 1), nchar(x[[i]]))
}
}
return(y)
}
> extract(list)
[[1]]
[1] "species of grass"
[[2]]
[1] "The clock says it is 420"
[[3]]
[1] "It is 420 already"
答案 1 :(得分:1)
一种简单的方法是使用此正则表达式替换从字符串开头就出现的所有数字,后跟空格,
^\d+\s+
带有空字符串。
sub("^\\d+\\s+", "", "420 species of grass")
sub("^\\d+\\s+", "", "The clock says it is 420")
sub("^\\d+\\s+", "", "It is 420 already")
打印
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
使用匹配来实现相同目的的另一种方法,您可以使用以下正则表达式并捕获group1的内容,
^(?:\d+\s+)?(.*)$
此外,放置在字符集中的任何内容都会失去其特殊含义,例如[^(^\\d+(?=\\s))]+
内的正向超前行为,并且仅表现为文字,因此您的正则表达式会变得不正确。
编辑:
尽管使用sub
的解决方案更好,但是如果您想使用R代码进行基于匹配的解决方案,则需要使用str_match
而不是str_extract_all
,并且需要访问group1内容[,2]
library(stringr)
print(str_match("420 species of grass", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("The clock says it is 420", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("It is 420 already", "^(?:\\d+\\s+)?(.*)$")[,2])
打印
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
答案 2 :(得分:1)
我认为最简单的方法是删除数字,而不是提取所需的模式:
library(stringr)
strings <- c("420 species of grass", "The clock says it is 420", "It is 420 already")
str_remove(strings, pattern = "^\\d+\\s")
[1] "species of grass" "The clock says it is 420" "It is 420 already"