在R中使用gsub清洁推文

时间:2018-11-19 18:27:17

标签: r

我正在尝试使用gsub清理一堆推文。

V3
1  Well: Getting Insurance to Pay for Midwives http://xxxxxxxxx
2  Lightning may be giving you a headache http://xxxxxxxx
3  New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot? http://xxxxxxxx
4  VIDEO: Can we erase memories entirely? http://xxxxxxxx
5  Artificial sweeteners are a $1.5-billion-a-year market @kchangnyt reported last year. http://xxxxxxxx

我试图使用以下代码删除所有链接(摘自SO的上一个问题):

newdf1$V3 <- gsub("http\\w+", "", newdf1$V3)

但是,这些推文没有任何变化。

此外,当我使用代码newdf1$V3 <- gsub("http.*", "", newdf1$V3)时,我可以删除链接:

V3
1  Well: Getting Insurance to Pay for Midwives 
2  Lightning may be giving you a headache 
3  New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot? 
4  VIDEO: Can we erase memories entirely? 
5  Artificial sweeteners are a $1.5-billion-a-year market @kchangnyt reported last year. 

有人可以解释为什么第一种情况下的代码无法产生预期的结果吗?

1 个答案:

答案 0 :(得分:0)

这是因为\w仅使用字母数字字符。由于http后面总是带有“://”,因此\w不会将其识别为合法表达式。

相比之下,.*只会拾取跟在“ http”之后的所有内容,因此可以正常工作。