删除除撇号之外的标点和使用R中的gsub的单词内短划线而不会意外地连接两个单词

时间:2015-10-11 03:32:23

标签: regex r gsub

我一直在Stackoverflow上搜索解决方案,并在R(RStudio)中试验了几个小时。我知道如何删除标点符号同时保留撇号,字内短划线和带有gsub的字内&(用于AT& T)(不是使用tm包但我想知道是否有些可以提供有关此操作的提示以及以下问题)。我想知道如何防止用gsub或任何其他正则表达式程序连接单词,其中我删除了一次的标点符号。到目前为止,这是我能做的最好的事情:

x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating  is a new**$ballgame but----why--- not?"

gsub("(\\w['&-]\\w)|[[:punct:]]", "\\1", x, perl=TRUE) 

#[1] "Good luckSPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventingconcatenating  is a newballgame butwhy not"

任何想法?此问题的目的是将解决方案应用于数据框列或社交媒体帖子的语料库。

2 个答案:

答案 0 :(得分:2)

你可以只使用一个函数留下前导/尾随空格:

gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}", " \\1", x)
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "

如果您能够使用 qdapRegex 套餐,则可以执行以下操作:

library(qdapRegex)
rm_default(x, pattern = "[^ a-zA-Z&'-]|[&'-]{2,}", replacement = " ")
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not"

答案 1 :(得分:0)

你可以:

  1. 匹配每个标点符号前后的所有空格,并在替换
  2. 中使用1个空格
  3. 限制[-'&]仅在非字边界之后或之前匹配\B
  4. <强>正则表达式:

    \s*(?:(?:\B[-'&]+|[-'&]+\B|[^-'&[:^punct:]]+)\s*)+
    
    • 请注意,我在[^-'&[:^punct:]]中使用双重否定字符从POSIX类-'&中排除[:punct:]

    <强>替换

    " "   (1 space)
    

    regex101 Demo

    <强>代码:

    x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating  is a new**$ballgame but----why--- not?"
    
    gsub("\\s*(?:(?:\\B[-'&]+|[-'&]+\\B|[^-'&[:^punct:]]+)\\s*)+", " ", x, perl=TRUE)
    
    #[1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating  is a new ballgame but why not "
    

    ideone Demo