如何使用R中的regex在文本中查找最长的字符串

时间:2016-01-09 13:59:38

标签: regex r

给定一个字符串x,我可以使用gregexpr(" [A-Za-z] \ w +",x)计算该字符串中的单词数(长度)。

> x<-"\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
> sapply(gregexpr("[A-Za-z]\\w+", x), function(x) sum(x > 0))
[1] 11

但是,如何使用R environnent下的正则表达式检索最长附加字符串中的单词数(空格而不是\ n)

在这个例子中,它将是#34; Masters Universitaires et Prives au Maroc&#34;长度是6。

先谢谢。

2 个答案:

答案 0 :(得分:2)

I would solve it with

x <- "\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
max(nchar(gsub("[^ ]+", "", unlist(strsplit(trimws(x), "\n+"))))) + 1

Split a trimmed string into lines, unlist the result, remove all characters other than a space, get the longest item and add one. The [^ ]+ is a regex that matches one or more (due to the + quantifier) characters other than (as [^...] is a negated character class) a space.

See IDEONE demo

答案 1 :(得分:1)

加载包

library(stringr)

创建新数据集,提取和分割短语

data <- unlist(str_split(x, pattern="\n", n = Inf))
index <- lapply(data, nchar)
index <- index !=0
# extract the maximum length of the phrase

max(sapply(gregexpr("\\W+", data[index]), length) + 1)
[1] 6

# just checking
data[index]
[1] "Masters Publics"                          
[2] "Masters Universitaires et Prives au Maroc"
[3] "\\n"                                      
[4] "Masters Par Ville"