删除单词R中的所有破折号

时间:2014-07-15 15:28:53

标签: regex r gsub

之前我曾问过类似的问题,但这个问题更具体,需要的解决方案与之前提供的解决方案不同,所以我希望发布它是可以的。我需要在我的文本中仅保留撇号和字内短划线(删除所有其他标点符号)。例如,我想从str1获得str2:

str1<-"I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
str2<-"I'm dash before word word dash in-between word two before word  word just dashes  between words word  word"

我到目前为止的解决方案,首先删除单词之间的破折号:
    gsub(" - ", " ", str1)

然后留下字母和数字字符以及剩余的破折号     gsub("[^[:alnum:]['-]", " ", str1)

问题是,它不会删除彼此之后的破折号,例如“ - ”和单词开头和结尾的短划线:“ - word”或“word - ”

3 个答案:

答案 0 :(得分:6)

我认为这样做:

gsub('( |^)-+|-+( |$)', '\\1', gsub("[^ [:alnum:]'-]", '', str1))
#[1] "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

答案 1 :(得分:4)

这是一种方法:

gsub("([[:alnum:]][[:punct:]][[:alnum:]])|[[:punct:]]", "\\1", str1)
# [1] "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

或者,更明确地说:

gsub("([[:alnum:]]['-][[:alnum:]])|[[:punct:]]", "\\1", str1)

同样的事情,略有不同/更短:

gsub("(\\w['-]\\w)|[[:punct:]]", "\\1", str1, perl=TRUE)

答案 2 :(得分:0)

我建议

x <- "I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
gsub("\\b([-'])\\b|[[:punct:]]+", "\\1", x, perl=TRUE)
# =>  "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

请参见R demo。正则表达式是

\b([-'])\b|[[:punct:]]+

请参见regex demo。详细信息:

  • \b([-'])\b--'并附有字符字符(字母,数字或_)(注意:如果只想在字母之间使用,请使用{ {1}}代替)
  • (?<=\p{L})([-'])(?=\p{L})-或
  • |-1个或更多个标点符号。

要删除此替换后产生的任何前导/尾随和双空格字符,可以使用

[[:punct:]]+