大写句子的第一个单词(regex,gsub,gregexpr)

时间:2014-04-10 00:31:08

标签: r

假设我有以下文字:

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")

我想把句子的第一个字母字符大写。

我想出了匹配的正则表达式:^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]

gregexpr的调用会返回:

> gregexpr("^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]", txt)
[[1]]
[1]   1  16  65  75 104 156
attr(,"match.length")
[1] 0 7 7 8 7 8
attr(,"useBytes")
[1] TRUE

哪些是匹配的正确子字符串索引。

但是,如何实现这一点以正确地利用我需要的字符呢?我假设我必须strsplit然后......?

2 个答案:

答案 0 :(得分:4)

您的regex似乎不适用于您的示例,因此我从this question偷了一个。

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")
print(txt)

gsub("([^.!?\\s])([^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?)(?=\\s|$)", "\\U\\1\\E\\2", txt, perl=T, useBytes = F)

答案 1 :(得分:1)

使用rex可能会使这类任务变得更简单一些。这实现了与merlin2011使用的相同的正则表达式。

txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me..  There are certain cases that I may not figure out??  sad!  ^_^")

re <- rex(
  capture(name = 'first_letter', alnum),
  capture(name = 'sentence',
    any_non_puncts,
    zero_or_more(
      group(
        punct %if_next_isnt% space,
        any_non_puncts
        )
      ),
    maybe(punct)
    )
  )

re_substitutes(txt, re, "\\U\\1\\E\\2", global = TRUE)
#>[1] "This is just a test! I'm not sure if this is O.K. Or if it will work? Who knows. Regex is sorta new to me..  There are certain cases that I may not figure out??  Sad!  ^_^"