Question

想要在句子中分割字符元素text的向量。分割标准有多种模式（"and/ERT"，"/$"）。此外，模式中还有例外情况（:/$.，and/ERT then，./$. Smiley）。

尝试：匹配拆分的情况。在该位置插入一个不寻常的图案（"^&*"）。 strsplit具体模式

问题：我不知道如何正确处理异常。在运行"^&*"之前，有明确的情况应该消除异常模式（strsplit）并恢复原始文本。

代码：

text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")

patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")

exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")

# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) # 

# Ideal split:
textsplitted
> textsplitted
[[1]]
 [1] "This are faulty propositions one and/ERT" 
 [2] "two ,/$," 
 [3] "which I want to split ./$."
 [4] "There are cases where I explicitly want and/ERT" 
 [5] "some where I don't want to split ./$." 
 [6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
 [7] "This is also one case where I dont't want to split ./$. Smiley !/$." 
 [8] "Thank you ./$!"

[[2]]
 [1] "This are the same faulty propositions one and/ERT 
 [2] "two ,/$,"
#...      

# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)

Answer 1

我认为您可以使用此表达式来获得所需的分割。当strsplit用掉它分割的字符时，你必须在要匹配的事物之后的空格上分开（不匹配）（这是你在OP中所需输出中所拥有的）：

strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)"  , perl = TRUE )
#[[1]]
#[1] "This are faulty propositions one and/ERT"                                 
#[2] "two ,/$,"                                                                 
#[3] "which I want to split ./$."                                               
#[4] "There are cases where I explicitly want and/ERT"                          
#[5] "some where I don't want to split ./$."                                    
#[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
#[7] "This is also one case where I dont't want to split ./$. Smiley !/$."      
#[8] "Thank you ./$!"

解释

(?<=and/ERT)\\s - 在\\s <{1}} <{1}}之前的空格(?<=...)上划分"and/ERT"
(?!then) - 但仅当该空格为 NOT 后，(?!...) "then"
| - OR运算符链接下一个表达式
(?<=/\\$[[:punct:]]) - "/$"的正面后瞻断言，后跟任何标点符号
(?<!:/\\$[[:punct:]])\\s(?!Smiley) - 匹配{strong> NOT 前面有":/$"[[:punct:]]的空格（但根据前一点 IS 前面有"/$[[:punct:]]"但不，(?!...) "Smiley"

R：具有多个正则表达式模式和异常的拆分文本

1 个答案:

解释