我有一个字符串向量 - myStrings
- 在R中看起来像:
[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.
其中another url
是一个有效的http网址但是stackoverflow不允许我插入多个网址,这就是为什么我正在编写another url
。我想删除myStrings
中的所有网址:
[1] download file from
[2] this is the link to my website
[3] go to from more info.
我在stringr
包中尝试了很多功能,但没有任何作用。
答案 0 :(得分:12)
您可以将gsub
与正则表达式匹配,
设置矢量:
x <- c(
"download file from http://example.com",
"this is the link to my website http://example.com",
"go to http://example.com from more info.",
"Another url ftp://www.example.com",
"And https://www.example.net"
)
从每个字符串中删除所有网址:
gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from" "this is the link to my website"
# [3] "go to from more info." "Another url"
# [5] "And"
更新:最好发布一些不同的网址,以便我们了解我们正在使用的内容。但我认为这个正则表达式适用于您在评论中提到的URL:
" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"
上面的表达解释了:
?
可选空间(f|ht)
匹配"f"
或"ht"
tp
匹配"tp"
(s?)
可选择与"s"
匹配(://)
匹配"://"
(.*)
匹配每个角色(所有内容)至[.|/]
句号或正斜杠(.*)
然后是我不是正则表达的专家,但我想我已正确解释过。
注意:在SO答案中不再允许网址缩短器,因此在进行最近的编辑时我被迫删除了一个部分。请参阅该部分的编辑历史记录。
答案 1 :(得分:8)
我一直致力于为这样的常见任务开发一组正则表达式,我已经把它扔进了一个最终会转到CRAN的包qdapRegex, on github。它还可以提取碎片并将它们分出来。欢迎对任何外观的包装反馈。
这是:
library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)
x <- c("download file from http://example.com",
"this is the link to my website http://example.com",
"go to http://example.com from more info.",
"Another url ftp://www.example.com",
"And https://www.example.net",
"twitter type: t.co/N1kq0F26tG",
"still another one https://t.co/N1kq0F26tG :-)")
rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))
## [1] "download file from" "this is the link to my website"
## [3] "go to from more info." "Another url"
## [5] "And" "twitter type:"
## [7] "still another one :-)"
rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)
## [[1]]
## [1] "http://example.com"
##
## [[2]]
## [1] "http://example.com"
##
## [[3]]
## [1] "http://example.com"
##
## [[4]]
## [1] "ftp://www.example.com"
##
## [[5]]
## [1] "https://www.example.net"
##
## [[6]]
## [1] "t.co/N1kq0F26tG"
##
## [[7]]
## [1] "https://t.co/N1kq0F26tG"
编辑我看到Twitter链接未被删除。我不会将此添加到特定于rm_url
函数的正则表达式中,但已将其添加到qdapRegex
中的字典中。因此,除了pastex
(粘贴正则表达式)之外,没有特定功能可以删除标准网址和推特,但您可以轻松地从字典中抓取正则表达式并将它们一起过去(使用管道运算符|
)。由于所有rm_XXX
样式函数的工作方式基本相同,因此您可以将pastex
输出传递给任何pattern
函数的rm_XXX
参数,或者创建您自己的函数,如下所示:< / p>
rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)
答案 2 :(得分:3)
str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info")
gsub('http\\S+\\s*',"", str1)
#[1] "download file from "
#[2] "this is the link to my website for more info"
library(stringr)
str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces
#[1] "download file from"
#[2] "this is the link to my website for more info"
为了匹配ftp
,我会在@Richard Scriven的帖子中使用相同的想法
str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info",
"this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info")
gsub('(f|ht)tp\\S+\\s*',"", str1)
#[1] "download file from "
#[2] "this is the link to my website for more info"
#[3] "this link to gives more info"
答案 3 :(得分:2)
以前的一些答案会移除到URL的末尾,而“\ b”扩展名会有所帮助。它还可以涵盖“sftp://”网址。
对于常规网址:
gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", x)
对于小网址:
gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x)