从字符串中删除URL

时间:2014-08-17 18:40:56

标签: r string stringr

我有一个字符串向量 - myStrings - 在R中看起来像:

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

其中another url是一个有效的http网址但是stackoverflow不允许我插入多个网址,这就是为什么我正在编写another url。我想删除myStrings中的所有网址:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

我在stringr包中尝试了很多功能,但没有任何作用。

4 个答案:

答案 0 :(得分:12)

您可以将gsub与正则表达式匹配,

设置矢量:

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

从每个字符串中删除所有网址:

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"   

更新:最好发布一些不同的网址,以便我们了解我们正在使用的内容。但我认为这个正则表达式适用于您在评论中提到的URL:

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

上面的表达解释了:

  • ?可选空间
  • (f|ht)匹配"f""ht"
  • tp匹配"tp"
  • (s?)可选择与"s"匹配
  • (://)匹配"://"
  • (.*)匹配每个角色(所有内容)至
  • [.|/]句号或正斜杠
  • (.*)然后是
  • 之后的所有内容

我不是正则表达的专家,但我想我已正确解释过。

注意:在SO答案中不再允许网址缩短器,因此在进行最近的编辑时我被迫删除了一个部分。请参阅该部分的编辑历史记录。

答案 1 :(得分:8)

我一直致力于为这样的常见任务开发一组正则表达式,我已经把它扔进了一个最终会转到CRAN的包qdapRegex, on github。它还可以提取碎片并将它们分出来。欢迎对任何外观的包装反馈。

这是:

library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)

x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))

## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"         

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)

## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

编辑我看到Twitter链接未被删除。我不会将此添加到特定于rm_url函数的正则表达式中,但已将其添加到qdapRegex中的字典中。因此,除了pastex(粘贴正则表达式)之外,没有特定功能可以删除标准网址和推特,但您可以轻松地从字典中抓取正则表达式并将它们一起过去(使用管道运算符|)。由于所有rm_XXX样式函数的工作方式基本相同,因此您可以将pastex输出传递给任何pattern函数的rm_XXX参数,或者创建您自己的函数,如下所示:< / p>

rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)

答案 2 :(得分:3)

 str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info")

 gsub('http\\S+\\s*',"", str1)
 #[1] "download file from "                         
 #[2] "this is the link to my website for more info"

 library(stringr)
 str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces
 #[1] "download file from"                          
 #[2] "this is the link to my website for more info"

更新

为了匹配ftp,我会在@Richard Scriven的帖子中使用相同的想法

  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info",
  "this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info")


  gsub('(f|ht)tp\\S+\\s*',"", str1)
  #[1] "download file from "                         
  #[2] "this is the link to my website for more info"
  #[3] "this link to gives more info"     

答案 3 :(得分:2)

以前的一些答案会移除到URL的末尾,而“\ b”扩展名会有所帮助。它还可以涵盖“sftp://”网址。

对于常规网址:

gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", x)

对于小网址:

gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x)