我有一段文字:"At https://www.google.com/ you can google questions!"
,我想使用tidytext
删除软件包str_remove_all
中的URL。
答案 0 :(得分:1)
使用基数R和正则表达式中的gsub
。使您的生活更轻松。
text <- "At https://www.google.com/ you can google questions!"
gsub('http\\S+\\s*', '', text)
[1] "At you can google questions!"
答案 1 :(得分:0)
我建议使用比其他答案更复杂的URL正则表达式,以便对各种类型的URL更可靠。
如果您习惯于在工作流中的其他地方使用tidyverse工具,那么使用stringr中的str_remove_all()
函数是个好主意。此功能是矢量化的,因此您可以向其传递文本矢量。
example <- c("At https://www.google.com/ you can google questions!",
"Come to https://www.stackoverflow.com/ for R answers",
"How many repos are there at https://www.stackoverflow.com/?")
library(stringr)
url_regex <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
str_remove_all(example, url_regex)
#> [1] "At you can google questions!" "Come to for R answers"
#> [3] "How many repos are there at "
如果将文本放入数据框中,还可以直接使用str_remove_all()
:
library(dplyr)
tibble(example) %>%
mutate(cleaned = str_remove_all(example, url_regex))
#> # A tibble: 3 x 2
#> example cleaned
#> <chr> <chr>
#> 1 At https://www.google.com/ you can google quest… At you can google ques…
#> 2 Come to https://www.stackoverflow.com/ for R an… Come to for R answers
#> 3 How many repos are there at https://www.stackov… "How many repos are the…
由reprex package(v0.3.0)于2019-07-10创建