想要在对字符串进行转换时使用str_replace_all
时知道错误:
abc <- "Good Product ...but it's darken the skin tone..why...?"
在使用quanteda运行句子标记化之前,我想进行额外的操作以便将其转换为类似下面的内容:
abc_new <- "Good Product. But it's darken the skin tone. Why?"
我正在使用以下正则表达式启用此功能:
str_replace_all(abc,"\\.{2,15}[a-z]{1}", paste(".", toupper(str_extract_all(str_extract_all(abc,"\\.{2,15}[a-z]{1}"),"[a-z]{1}")[[1]])[[1]], collapse = " "))
但是这会引发:“好产品。 C 它会使肤色变暗。 C hy ......?”
有人可以为此提出解决方案吗?
答案 0 :(得分:1)
考虑到它的长度和嵌套时间,读取和理解您提供的替换代码真的非常非常困难。
我会尝试将复杂模式分解为较小且可追踪的模式,我可以轻松调试。可以通过将中间结果分配给临时变量或使用管道运算符来实现:
library(magrittr)
string <- "Good Product ...but it's darken the skin tone..why...?"
string %>%
gsub("\\.+\\?", "?", .) %>% # Remove full-stops before question marks
gsub("\\.+", ".", .) %>% # Replace all multiple dots with a single one
gsub(" \\.", ".", .) %>% # Remove space before dots
gsub("(\\.)([^ ])", ". \\2", .) %>% # Add a space between the full-stop and the next sentance
gsub("(\\.) ([[:alpha:]])", ". \\U\\2", ., perl=TRUE) # Replace first letter after the full-stop with it's upper caps
# [1] "Good Product. But it's darken the skin tone. Why?"
答案 1 :(得分:1)
您似乎正在尝试匹配要删除的模式,使用您希望保留在该模式中的部分内容。在正则表达式中,您可以使用()
标记要在替换中使用的模式的一部分。
考虑你的情况:
abc <- "Good Product ...but it's darken the skin tone..why...?"
step1 <- gsub(" ?\\.+([a-zA-Z])",". \\U\\1",abc,perl=TRUE)
step1
#> [1] "Good Product. But it's darken the skin tone. Why...?"
匹配的表达式分解为:
? #Optionally match a space (to handle the space after Good Product)
\\.+ #Match at least one period
([a-zA-Z]) #Match one letter and remember it
替换模式
. #Insert a period followed by a space
\\U #Insert an uppercase version...
\\1 #of whatever was matched in the first set of parenthesis
现在,这并不能修复省略号,后跟问号。后续匹配可以解决这个问题。
step2 = gsub("\\.+([^\\. ])","\\1",step1)
step2
#> [1] "Good Product. But it's darken the skin tone. Why?"
我们在这里匹配
\\.+ #at least one period
([^\\. ]) #one character that is not a period or a space and remember it
替换为
\\1 #The thing we remembered
所以,两个步骤,两个相当通用的正则表达式,也应该扩展到其他用例。