Question

我有一个字符串向量 - myStrings - 在R中看起来像：

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

其中another url是一个有效的http网址但是stackoverflow不允许我插入多个网址，这就是为什么我正在编写another url。我想删除myStrings中的所有网址：

[1] download file from
[2] this is the link to my website
[3] go to from more info.

我在stringr包中尝试了很多功能，但没有任何作用。

Answer 1

您可以将gsub与正则表达式匹配，

设置矢量：

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

从每个字符串中删除所有网址：

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"

更新：最好发布一些不同的网址，以便我们了解我们正在使用的内容。但我认为这个正则表达式适用于您在评论中提到的URL：

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

上面的表达解释了：

?可选空间
(f|ht)匹配"f"或"ht"
tp匹配"tp"
(s?)可选择与"s"匹配
(://)匹配"://"
(.*)匹配每个角色（所有内容）至
[.|/]句号或正斜杠
(.*)然后是

我不是正则表达的专家，但我想我已正确解释过。

注意：在SO答案中不再允许网址缩短器，因此在进行最近的编辑时我被迫删除了一个部分。请参阅该部分的编辑历史记录。

Answer 2

我一直致力于为这样的常见任务开发一组正则表达式，我已经把它扔进了一个最终会转到CRAN的包qdapRegex, on github。它还可以提取碎片并将它们分出来。欢迎对任何外观的包装反馈。

这是：

library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)

x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))

## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"         

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)

## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

编辑我看到Twitter链接未被删除。我不会将此添加到特定于rm_url函数的正则表达式中，但已将其添加到qdapRegex中的字典中。因此，除了pastex（粘贴正则表达式）之外，没有特定功能可以删除标准网址和推特，但您可以轻松地从字典中抓取正则表达式并将它们一起过去（使用管道运算符|）。由于所有rm_XXX样式函数的工作方式基本相同，因此您可以将pastex输出传递给任何pattern函数的rm_XXX参数，或者创建您自己的函数，如下所示：< / p>

rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)

Answer 3

 str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info")

 gsub('http\\S+\\s*',"", str1)
 #[1] "download file from "                         
 #[2] "this is the link to my website for more info"

 library(stringr)
 str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces
 #[1] "download file from"                          
 #[2] "this is the link to my website for more info"

更新

为了匹配ftp，我会在@Richard Scriven的帖子中使用相同的想法

  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info",
  "this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info")


  gsub('(f|ht)tp\\S+\\s*',"", str1)
  #[1] "download file from "                         
  #[2] "this is the link to my website for more info"
  #[3] "this link to gives more info"

Answer 4

以前的一些答案会移除到URL的末尾，而“\ b”扩展名会有所帮助。它还可以涵盖“sftp：//”网址。

对于常规网址：

gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", x)

对于小网址：

gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x)

从字符串中删除URL

4 个答案:

更新