我有这样的字符串:
x <-c("DATE TODAY d. 011 + e. 0030 + r. 1061","Now or never d. 003 + e. 011 + g. 021", "Long term is long time (e. 104 to d. 10110)","Time is everything (1012) - /1072, 091A/")
期望的输出:
d <- c("DATE TODAY","Now or never","Long term is long time","Time is everything")
经过SO搜索一小时后,我无法做到。任何帮助表示赞赏。
答案 0 :(得分:4)
此位使用stringr
提取包含两个或多个字母的任何内容:
> library(stringr)
> unlist(lapply(str_extract_all(x,"[a-zA-Z][a-zA-Z]+"),paste,collapse=" "))
[1] "DATE TODAY" "Now or never"
[3] "Long term is long time to" "Time is everything"
我希望你所希望的输出中“丢失”是你的错误。这是一个非常好的词,你说你想要提取单词。
答案 1 :(得分:1)
模式不是很清楚。但是,根据示例显示,这里有两种方法可以获得预期的结果。
sub('( .\\.| \\().*', '', x)
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
或
pat1 <- '(?<=[0-9] )[A-Za-z]+(*SKIP)(*F)|[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
如果to
是有效字词且预期结果为typo
pat1 <- '[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never"
#[3] "Long term is long time to" "Time is everything"
答案 2 :(得分:1)
我同意其他人&#34; to&#34;是一个有效的词。这是stringi
方法
library(stringi)
stri_replace_all_regex(x, "\\s?[A-Za-z]?[+[:punct:]0-9]", "")
# [1] "DATE TODAY" "Now or never"
# [3] "Long term is long time to" "Time is everything"