我正在尝试使用gsub清理一堆推文。
V3
1 Well: Getting Insurance to Pay for Midwives http://xxxxxxxxx
2 Lightning may be giving you a headache http://xxxxxxxx
3 New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot? http://xxxxxxxx
4 VIDEO: Can we erase memories entirely? http://xxxxxxxx
5 Artificial sweeteners are a $1.5-billion-a-year market @kchangnyt reported last year. http://xxxxxxxx
我试图使用以下代码删除所有链接(摘自SO的上一个问题):
newdf1$V3 <- gsub("http\\w+", "", newdf1$V3)
但是,这些推文没有任何变化。
此外,当我使用代码newdf1$V3 <- gsub("http.*", "", newdf1$V3)
时,我可以删除链接:
V3
1 Well: Getting Insurance to Pay for Midwives
2 Lightning may be giving you a headache
3 New York City is requiring flu shots for kids under 5 in city preschools and day care. Do your kids get the flu shot?
4 VIDEO: Can we erase memories entirely?
5 Artificial sweeteners are a $1.5-billion-a-year market @kchangnyt reported last year.
有人可以解释为什么第一种情况下的代码无法产生预期的结果吗?
答案 0 :(得分:0)
这是因为\w
仅使用字母数字字符。由于http后面总是带有“://”,因此\w
不会将其识别为合法表达式。
相比之下,.*
只会拾取跟在“ http”之后的所有内容,因此可以正常工作。