R中的模式替换

时间:2014-06-09 15:01:59

标签: regex r twitter

我正在研究R中的Twitter数据集,而我发现很难从推文中删除用户名。

这是我的数据集的推文列中推文的一个示例:

[1] "@danimottale: 2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."         
[2] "@FreeMktMonkey @drleegross Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"

我想删除/替换所有以&#34开头的单词; @"得到这个输出:

[1] "2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."         
[2] "Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"

这个gsub函数只用于删除" @"符号

gsub("@", "", tweetdata$tweets)

我想说,删除文字符号后面的字符,直到遇到空格或标点符号。

我开始尝试处理空间但无济于事:

gsub("@.*[:space:]$", "", tweetdata$tweets)

这完全删除了第二条推文

gsub("@.*[:blank:]$", "", tweetdata$tweets)

这不会改变输出。

我将非常感谢你的帮助。

2 个答案:

答案 0 :(得分:9)

您可以使用以下内容。 \S+匹配任何非空白字符(1或更多次),然后匹配单个空格字符。

gsub('@\\S+\\s', '', noRT$text)

Working Demo

编辑:否定匹配也可以正常工作(仅使用空格字符

gsub('@[^ ]+ ', '', noRT$text)

答案 1 :(得分:1)

这里的正则表达式方法简单直接。我添加了第二个选项,允许您使用qdap&{39} genX函数删除任意2个边界之间的文本。这允许您提供左右边界。

library(qdap)
genX(x, "@", "\\s")

## [1] "2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."
## [2] "Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"