如何使用R提取URL中的特定子字符串?

时间:2018-08-13 10:45:25

标签: r

如何修改代码,以便获得“测试图”,而不仅仅是“测试”?

我想捕获以下两个网址的“ https://”和“ .google.com”之间的所有字符;

https://test-maps.google.comhttps://ulla.google.com

因此,我只想使用同一段代码提取“ test-maps”和“ ulla”。 我已成功实现以下R代码;

url <- c("https://ulla.google.com", "https://test-maps.google.com") 
pat = "(https://*?)(\\w+)(.*)"
gsub(pat, "\\2", url)

实际输出

"ulla" "test"

预期输出

"ulla" "test-maps"

4 个答案:

答案 0 :(得分:2)

您可以使用urltools软件包:

/blah/some/action /blah2/some/action/42 /halb/some/action/42.json 提取域和主机。由于我们只需要使用host_extract的主机,因此只会返回主机值。 使用host_extract(url)$host获取url模式(http或https),并将其与scheme粘贴在一起,您可以://sapply获得所需的内容。

lapply

答案 1 :(得分:1)

还有其他一些选择:

url <- c("https://ulla.google.com", "https://test-maps.google.com") 

gsub("^.*?//(.*)?\\.google.*?$", "\\1", url)
#> [1] "ulla"      "test-maps"

unlist(regmatches(url, gregexpr("^.*?//\\K(\\w|-)+", url, perl=TRUE)))
#> [1] "ulla"      "test-maps"

library(stringr)
str_extract(url, "(?<=//).*?(?=\\.)")
#> [1] "ulla"      "test-maps"

str_extract(url, "(\\w|-)+(?=\\.)")
#> [1] "ulla"      "test-maps"

如果我们查看此处和其他人列出的所有解决方案的基准:

microbenchmark::microbenchmark(
  r1 = gsub("^.*?//(.*)?\\.google.*?$", "\\1", url),
  r2 = unlist(regmatches(url, gregexpr("^.*?//\\K(\\w|-)+", url, perl=TRUE))),
  r3 = str_extract(url, "(?<=//).*?(?=\\.)"),
  r4 = str_extract(url, "(\\w|-)+(?=\\.)"),
  r5 = url %>% str_replace("\\w+\\:\\//", "") %>% str_replace("\\.\\w+\\.\\w+", ""),
  r6 = url %>% gsub("\\..*","",.) %>% gsub("(https://*?)(\\w+)(*)", "\\2", .),
  r7 = sapply(url, function(x) gsub(paste0(scheme(x), "://"), "", host_extract(x)$host), USE.NAMES = FALSE),
  times = 1000
) 
#> Unit: microseconds
#>  expr     min       lq      mean   median       uq       max neval
#>    r1  25.188  36.2695  42.09713  40.5385  44.6705   121.243  1000
#>    r2  63.554  93.7230 116.28898 101.6285 116.1940  3407.797  1000
#>    r3  20.644  32.5505  41.63846  39.0320  45.1230   183.720  1000
#>    r4  32.574  45.7445  57.49725  53.5265  60.0635   662.852  1000
#>    r5 305.978 356.8885 422.22098 379.7260 428.6380  4387.231  1000
#>    r6 160.318 198.6030 251.32088 216.3115 241.3045  6136.862  1000
#>    r7 553.548 612.4135 745.39361 638.5895 720.6745 25381.766  1000

看来最快的是gsub("^.*?//(.*)?\\.google.*?$", "\\1", url)str_extract(url, "(?<=//).*?(?=\\.)")

答案 2 :(得分:0)

只需使用以下内容:

url <- c("https://ulla.google.com", "https://test-maps.google.com")
url
[1] "https://ulla.google.com"      "https://test-maps.google.com"
url=gsub("\\..*","",url)  # Extract everything before first dot (.)
url
[1] "https://ulla"      "https://test-maps"
pat = "(https://*?)(\\w+)(*)" # Extract evrything after //
gsub(pat, "\\2", a)
[1] "ulla"      "test-maps"

已更新(添加了管道解决方案)

library(stringr)
url %>%
gsub("\\..*","",.) %>%
gsub("(https://*?)(\\w+)(*)", "\\2", .)

答案 3 :(得分:0)

library(stringr)
url <- c("https://ulla.google.com", "https://test-maps.google.com")
# remove front bit
remove_https <- str_replace(url, "\\w+\\:\\//", "")
#remove back bit
just_host <- str_replace(remove_https, "\\.\\w+\\.\\w+", "")
just_host

或设为AS

just_host <- url %>%
str_replace("\\w+\\:\\//", "") %>%
str_replace("\\.\\w+\\.\\w+", "")
just_host