Question

我正在尝试使用gsub清理一堆推文。

V3
1  Well: Getting Insurance to Pay for Midwives http://xxxxxxxxx
2  Lightning may be giving you a headache http://xxxxxxxx
3  New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot? http://xxxxxxxx
4  VIDEO: Can we erase memories entirely? http://xxxxxxxx
5  Artificial sweeteners are a $1.5-billion-a-year market @kchangnyt reported last year. http://xxxxxxxx

我试图使用以下代码删除所有链接（摘自SO的上一个问题）：

newdf1$V3 <- gsub("http\\w+", "", newdf1$V3)

但是，这些推文没有任何变化。

此外，当我使用代码newdf1$V3 <- gsub("http.*", "", newdf1$V3)时，我可以删除链接：

V3
1  Well: Getting Insurance to Pay for Midwives 
2  Lightning may be giving you a headache 
3  New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot? 
4  VIDEO: Can we erase memories entirely? 
5  Artificial sweeteners are a $1.5-billion-a-year market @kchangnyt reported last year.

有人可以解释为什么第一种情况下的代码无法产生预期的结果吗？

Answer 1

这是因为\w仅使用字母数字字符。由于http后面总是带有“：//”，因此\w不会将其识别为合法表达式。

相比之下，.*只会拾取跟在“ http”之后的所有内容，因此可以正常工作。

在R中使用gsub清洁推文

1 个答案: