之前我曾问过类似的问题,但这个问题更具体,需要的解决方案与之前提供的解决方案不同,所以我希望发布它是可以的。我需要在我的文本中仅保留撇号和字内短划线(删除所有其他标点符号)。例如,我想从str1获得str2:
str1<-"I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
str2<-"I'm dash before word word dash in-between word two before word word just dashes between words word word"
我到目前为止的解决方案,首先删除单词之间的破折号:
gsub(" - ", " ", str1)
然后留下字母和数字字符以及剩余的破折号
gsub("[^[:alnum:]['-]", " ", str1)
问题是,它不会删除彼此之后的破折号,例如“ - ”和单词开头和结尾的短划线:“ - word”或“word - ”
答案 0 :(得分:6)
我认为这样做:
gsub('( |^)-+|-+( |$)', '\\1', gsub("[^ [:alnum:]'-]", '', str1))
#[1] "I'm dash before word word dash in-between word two before word word just dashes between words word word"
答案 1 :(得分:4)
这是一种方法:
gsub("([[:alnum:]][[:punct:]][[:alnum:]])|[[:punct:]]", "\\1", str1)
# [1] "I'm dash before word word dash in-between word two before word word just dashes between words word word"
或者,更明确地说:
gsub("([[:alnum:]]['-][[:alnum:]])|[[:punct:]]", "\\1", str1)
同样的事情,略有不同/更短:
gsub("(\\w['-]\\w)|[[:punct:]]", "\\1", str1, perl=TRUE)
答案 2 :(得分:0)
我建议
x <- "I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
gsub("\\b([-'])\\b|[[:punct:]]+", "\\1", x, perl=TRUE)
# => "I'm dash before word word dash in-between word two before word word just dashes between words word word"
请参见R demo。正则表达式是
\b([-'])\b|[[:punct:]]+
请参见regex demo。详细信息:
\b([-'])\b
--
或'
并附有字符字符(字母,数字或_
)(注意:如果只想在字母之间使用,请使用{ {1}}代替)(?<=\p{L})([-'])(?=\p{L})
-或|
-1个或更多个标点符号。要删除此替换后产生的任何前导/尾随和双空格字符,可以使用
[[:punct:]]+