我一直在Stackoverflow上搜索解决方案,并在R(RStudio)中试验了几个小时。我知道如何删除标点符号同时保留撇号,字内短划线和带有gsub的字内&(用于AT& T)(不是使用tm包但我想知道是否有些可以提供有关此操作的提示以及以下问题)。我想知道如何防止用gsub或任何其他正则表达式程序连接单词,其中我删除了一次的标点符号。到目前为止,这是我能做的最好的事情:
x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating is a new**$ballgame but----why--- not?"
gsub("(\\w['&-]\\w)|[[:punct:]]", "\\1", x, perl=TRUE)
#[1] "Good luckSPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventingconcatenating is a newballgame butwhy not"
任何想法?此问题的目的是将解决方案应用于数据框列或社交媒体帖子的语料库。
答案 0 :(得分:2)
你可以只使用一个函数留下前导/尾随空格:
gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}", " \\1", x)
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "
如果您能够使用 qdapRegex 套餐,则可以执行以下操作:
library(qdapRegex)
rm_default(x, pattern = "[^ a-zA-Z&'-]|[&'-]{2,}", replacement = " ")
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not"
答案 1 :(得分:0)
你可以:
[-'&]
仅在非字边界之后或之前匹配\B
<强>正则表达式:强>
\s*(?:(?:\B[-'&]+|[-'&]+\B|[^-'&[:^punct:]]+)\s*)+
[^-'&[:^punct:]]
中使用双重否定字符从POSIX类-'&
中排除[:punct:]
<强>替换强>
" " (1 space)
<强>代码:强>
x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating is a new**$ballgame but----why--- not?"
gsub("\\s*(?:(?:\\B[-'&]+|[-'&]+\\B|[^-'&[:^punct:]]+)\\s*)+", " ", x, perl=TRUE)
#[1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "