假设我有以下文字:
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
我想把句子的第一个字母字符大写。
我想出了匹配的正则表达式:^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]
对gregexpr
的调用会返回:
> gregexpr("^|[[:alnum:]]+[[:alnum:]]+[.!?]+[[:space:]]*[[:space:]]+[[:alnum:]]", txt)
[[1]]
[1] 1 16 65 75 104 156
attr(,"match.length")
[1] 0 7 7 8 7 8
attr(,"useBytes")
[1] TRUE
哪些是匹配的正确子字符串索引。
但是,如何实现这一点以正确地利用我需要的字符呢?我假设我必须strsplit
然后......?
答案 0 :(得分:4)
您的regex
似乎不适用于您的示例,因此我从this question偷了一个。
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
print(txt)
gsub("([^.!?\\s])([^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?)(?=\\s|$)", "\\U\\1\\E\\2", txt, perl=T, useBytes = F)
答案 1 :(得分:1)
使用rex可能会使这类任务变得更简单一些。这实现了与merlin2011使用的相同的正则表达式。
txt <- as.character("this is just a test! i'm not sure if this is O.K. or if it will work? who knows. regex is sorta new to me.. There are certain cases that I may not figure out?? sad! ^_^")
re <- rex(
capture(name = 'first_letter', alnum),
capture(name = 'sentence',
any_non_puncts,
zero_or_more(
group(
punct %if_next_isnt% space,
any_non_puncts
)
),
maybe(punct)
)
)
re_substitutes(txt, re, "\\U\\1\\E\\2", global = TRUE)
#>[1] "This is just a test! I'm not sure if this is O.K. Or if it will work? Who knows. Regex is sorta new to me.. There are certain cases that I may not figure out?? Sad! ^_^"