仅使用R提取单词

时间:2014-12-30 11:46:10

标签: r

我有这样的字符串:

x <-c("DATE TODAY d. 011 + e. 0030 + r. 1061","Now or never d. 003 + e. 011 + g. 021", "Long term is long time (e. 104 to d. 10110)","Time is everything (1012) - /1072, 091A/")

期望的输出:

d <- c("DATE TODAY","Now or never","Long term is long time","Time is everything")

经过SO搜索一小时后,我无法做到。任何帮助表示赞赏。

3 个答案:

答案 0 :(得分:4)

此位使用stringr提取包含两个或多个字母的任何内容:

> library(stringr)
> unlist(lapply(str_extract_all(x,"[a-zA-Z][a-zA-Z]+"),paste,collapse=" "))
[1] "DATE TODAY"                "Now or never"             
[3] "Long term is long time to" "Time is everything"     

我希望你所希望的输出中“丢失”是你的错误。这是一个非常好的词,你说你想要提取单词。

答案 1 :(得分:1)

模式不是很清楚。但是,根据示例显示,这里有两种方法可以获得预期的结果。

sub('( .\\.| \\().*', '', x)
#[1] "DATE TODAY"             "Now or never"           "Long term is long time"
#[4] "Time is everything"    

 pat1 <- '(?<=[0-9] )[A-Za-z]+(*SKIP)(*F)|[A-Za-z]{2,}'
 sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY"             "Now or never"           "Long term is long time"
#[4] "Time is everything"    

如果to是有效字词且预期结果为typo

 pat1 <- '[A-Za-z]{2,}'
 sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
 #[1] "DATE TODAY"                "Now or never"             
 #[3] "Long term is long time to" "Time is everything"  

答案 2 :(得分:1)

我同意其他人&#34; to&#34;是一个有效的词。这是stringi方法

library(stringi)

stri_replace_all_regex(x, "\\s?[A-Za-z]?[+[:punct:]0-9]", "")
# [1] "DATE TODAY"                "Now or never"             
# [3] "Long term is long time to" "Time is everything"