删除除R中的撇号之外的所有标点符号

时间:2012-01-02 03:01:48

标签: r

我想使用R的gsub从文本中删除除撇号之外的所有标点符号。我对正则表达式很新,但我正在学习。

示例:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[[:punct:]]", "", as.character(x))

当前输出(没有撇号)

[1] "I like to chew gum but dont like bubble gum"

期望的输出(我希望撇号不要留下来)

[1] "I like to chew gum but don't like bubble gum"

4 个答案:

答案 0 :(得分:36)

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)

[1] "I like to chew gum but don't like bubble gum"

上述正则表达式更加直截了当。它用空字符串替换不是字母数字符号,空格或撇号(插入符号!)的所有内容。

答案 1 :(得分:7)

以下是一个例子:

>  gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x)
[1] "I like to chew gum but don't like bubble gum"

答案 2 :(得分:5)

大多数情况下,这是一个使用同名的极好包中的gsubfn()的解决方案。在这个应用程序中,我只是喜欢它允许的解决方案表达得非常好:

library(gsubfn)
gsubfn(pattern = "[[:punct:]]", engine = "R",
       replacement = function(x) ifelse(x == "'", "'", ""), 
       x)
[1] "I like to chew gum but don't like bubble gum"

(此处需要参数engine = "R",否则将使用默认的tcl引擎。其匹配正则表达式的规则略有不同:例如,如果它用于处理上面的字符串,则需要而是设置pattern = "[[:punct:]$|^]"。感谢G. Grothendieck指出了这个细节。)

答案 3 :(得分:4)

您可以使用双重否定从POSIX类punct中排除撇号:

[^'[:^punct:]]

<强>代码:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^'[:^punct:]]", "", x, perl=T)

#[1] "I like to chew gum but don't like bubble gum"

ideone demo