Question

我有一段文字："At https://www.google.com/ you can google questions!"，我想使用tidytext删除软件包str_remove_all中的URL。

我该怎么办？
如果我的向量中包含更多这样的文本，如何删除向量中任何元素的URL？

Answer 1

使用基数R和正则表达式中的gsub。使您的生活更轻松。

text <- "At https://www.google.com/ you can google questions!"

gsub('http\\S+\\s*', '', text)

[1]  "At you can google questions!"

Answer 2

我建议使用比其他答案更复杂的URL正则表达式，以便对各种类型的URL更可靠。

如果您习惯于在工作流中的其他地方使用tidyverse工具，那么使用stringr中的str_remove_all()函数是个好主意。此功能是矢量化的，因此您可以向其传递文本矢量。

example <- c("At https://www.google.com/ you can google questions!",
             "Come to https://www.stackoverflow.com/ for R answers",
             "How many repos are there at https://www.stackoverflow.com/?")

library(stringr)
url_regex <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

str_remove_all(example, url_regex)
#> [1] "At  you can google questions!" "Come to  for R answers"       
#> [3] "How many repos are there at "

如果将文本放入数据框中，还可以直接使用str_remove_all()：

library(dplyr)

tibble(example) %>%
    mutate(cleaned = str_remove_all(example, url_regex))

#> # A tibble: 3 x 2
#>   example                                          cleaned                 
#>   <chr>                                            <chr>                   
#> 1 At https://www.google.com/ you can google quest… At  you can google ques…
#> 2 Come to https://www.stackoverflow.com/ for R an… Come to  for R answers  
#> 3 How many repos are there at https://www.stackov… "How many repos are the…

^{由reprex package（v0.3.0）于2019-07-10创建}

如何使用Stringr包从文本向量中删除URL？

2 个答案: