正则表达式;消除所有标点符号除外

时间:2012-11-14 03:10:33

标签: r strsplit

我有以下正则表达式,可以分割任何空格或标点符号。如何从:punct:中排除1个或多个标点字符?假设我想排除撇号和逗号。我知道我可以明确使用[all punctuation marks in here]代替[[:punct:]],但我希望有一种排除方法。

X <- "I'm not that good at regex yet, but am getting better!"
strsplit(X, "[[:space:]]|(?=[[:punct:]])", perl=TRUE)

 [1] "I"       "'"       "m"       "not"     "that"    "good"    "at"      "regex"   "yet"    
[10] ","       ""        "but"     "am"      "getting" "better"  "!"

2 个答案:

答案 0 :(得分:8)

我不清楚你想要的结果是什么,但你可以使用负面的like this answer

R> strsplit(X, "[[:space:]]|(?=[^,'[:^punct:]])", perl=TRUE)[[1]]
 [1] "I'm"     "not"     "that"    "good"    "at"      "regex"   "yet,"   
 [8] "but"     "am"      "getting" "better"  "!"    

答案 1 :(得分:0)

如果右侧的下一个字符为(?![',])',则您可以直接使用, negative lookahead对PCRE子模式施加限制,但匹配失败:< / p>

[[:space:]]|(?=(?![',])[[:punct:]])
               ^^^^^^^^ 

请参阅regex demo

<强>详情

  • [[:space:]] - 任何空白
  • | - 或
  • (?=(?![',])[[:punct:]]) - 一个积极的前瞻,要求在当前位置的右侧,没有',并且有任何1个标点符号不是',(实际上,需要',以外的任何标点符号。

请参阅R online demo

X <- "I'm not that good at regex yet, but am getting better!"
strsplit(X, "[[:space:]]|(?=(?![',])[[:punct:]])", perl=TRUE)
[[1]]
 [1] "I'm"     "not"     "that"    "good"    "at"      "regex"   "yet,"   
 [8] "but"     "am"      "getting" "better"  "!"