我想使用R的gsub从文本中删除除撇号之外的所有标点符号。我对正则表达式很新,但我正在学习。
示例:
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[[:punct:]]", "", as.character(x))
当前输出(没有撇号)
[1] "I like to chew gum but dont like bubble gum"
期望的输出(我希望撇号不要留下来)
[1] "I like to chew gum but don't like bubble gum"
答案 0 :(得分:36)
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)
[1] "I like to chew gum but don't like bubble gum"
上述正则表达式更加直截了当。它用空字符串替换不是字母数字符号,空格或撇号(插入符号!)的所有内容。
答案 1 :(得分:7)
以下是一个例子:
> gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x)
[1] "I like to chew gum but don't like bubble gum"
答案 2 :(得分:5)
大多数情况下,这是一个使用同名的极好包中的gsubfn()
的解决方案。在这个应用程序中,我只是喜欢它允许的解决方案表达得非常好:
library(gsubfn)
gsubfn(pattern = "[[:punct:]]", engine = "R",
replacement = function(x) ifelse(x == "'", "'", ""),
x)
[1] "I like to chew gum but don't like bubble gum"
(此处需要参数engine = "R"
,否则将使用默认的tcl引擎。其匹配正则表达式的规则略有不同:例如,如果它用于处理上面的字符串,则需要而是设置pattern = "[[:punct:]$|^]"
。感谢G. Grothendieck指出了这个细节。)
答案 3 :(得分:4)
您可以使用双重否定从POSIX类punct
中排除撇号:
[^'[:^punct:]]
<强>代码:强>
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^'[:^punct:]]", "", x, perl=T)
#[1] "I like to chew gum but don't like bubble gum"