Question

想要在对字符串进行转换时使用str_replace_all时知道错误：

abc <- "Good Product ...but it's darken the skin tone..why...?"

在使用quanteda运行句子标记化之前，我想进行额外的操作以便将其转换为类似下面的内容：

abc_new <- "Good Product. But it's darken the skin tone. Why?"

我正在使用以下正则表达式启用此功能：

str_replace_all(abc,"\\.{2,15}[a-z]{1}", paste(".", toupper(str_extract_all(str_extract_all(abc,"\\.{2,15}[a-z]{1}"),"[a-z]{1}")[[1]])[[1]], collapse = " "))

但是这会引发：“好产品。 C 它会使肤色变暗。 C hy ......？”

有人可以为此提出解决方案吗？

Answer 1

考虑到它的长度和嵌套时间，读取和理解您提供的替换代码真的非常非常困难。

我会尝试将复杂模式分解为较小且可追踪的模式，我可以轻松调试。可以通过将中间结果分配给临时变量或使用管道运算符来实现：

library(magrittr)
string <- "Good Product ...but it's darken the skin tone..why...?"
string %>% 
  gsub("\\.+\\?", "?", .) %>%   # Remove full-stops before question marks
  gsub("\\.+", ".", .) %>%      # Replace all multiple dots with a single one
  gsub(" \\.", ".", .) %>%      # Remove space before dots
  gsub("(\\.)([^ ])", ". \\2", .) %>%  # Add a space between the full-stop and the next sentance 
  gsub("(\\.) ([[:alpha:]])", ". \\U\\2", ., perl=TRUE) # Replace first letter after the full-stop with it's upper caps

  # [1] "Good Product. But it's darken the skin tone. Why?"

Answer 2

您似乎正在尝试匹配要删除的模式，使用您希望保留在该模式中的部分内容。在正则表达式中，您可以使用()标记要在替换中使用的模式的一部分。

考虑你的情况：

abc <- "Good Product ...but it's darken the skin tone..why...?"
step1 <- gsub(" ?\\.+([a-zA-Z])",". \\U\\1",abc,perl=TRUE)
step1
#> [1] "Good Product. But it's darken the skin tone. Why...?"

匹配的表达式分解为：

 ?         #Optionally match a space (to handle the space after Good Product)
\\.+       #Match at least one period
([a-zA-Z]) #Match one letter and remember it

替换模式

.       #Insert a period followed by a space
\\U     #Insert an uppercase version...
   \\1    #of whatever was matched in the first set of parenthesis

现在，这并不能修复省略号，后跟问号。后续匹配可以解决这个问题。

step2 = gsub("\\.+([^\\. ])","\\1",step1)
step2
#> [1] "Good Product. But it's darken the skin tone. Why?"

我们在这里匹配

\\.+      #at least one period
([^\\. ]) #one character that is not a period or a space and remember it

替换为

\\1 #The thing we remembered

所以，两个步骤，两个相当通用的正则表达式，也应该扩展到其他用例。

部分正则表达式导致R

2 个答案: