如何修改代码,以便获得“测试图”,而不仅仅是“测试”?
我想捕获以下两个网址的“ https://”和“ .google.com”之间的所有字符;
https://test-maps.google.com和https://ulla.google.com
因此,我只想使用同一段代码提取“ test-maps”和“ ulla”。 我已成功实现以下R代码;
url <- c("https://ulla.google.com", "https://test-maps.google.com")
pat = "(https://*?)(\\w+)(.*)"
gsub(pat, "\\2", url)
实际输出
"ulla" "test"
预期输出
"ulla" "test-maps"
答案 0 :(得分:2)
您可以使用urltools软件包:
/blah/some/action
/blah2/some/action/42
/halb/some/action/42.json
提取域和主机。由于我们只需要使用host_extract
的主机,因此只会返回主机值。
使用host_extract(url)$host
获取url模式(http或https),并将其与scheme
粘贴在一起,您可以://
或sapply
获得所需的内容。
lapply
答案 1 :(得分:1)
还有其他一些选择:
url <- c("https://ulla.google.com", "https://test-maps.google.com")
gsub("^.*?//(.*)?\\.google.*?$", "\\1", url)
#> [1] "ulla" "test-maps"
unlist(regmatches(url, gregexpr("^.*?//\\K(\\w|-)+", url, perl=TRUE)))
#> [1] "ulla" "test-maps"
library(stringr)
str_extract(url, "(?<=//).*?(?=\\.)")
#> [1] "ulla" "test-maps"
str_extract(url, "(\\w|-)+(?=\\.)")
#> [1] "ulla" "test-maps"
如果我们查看此处和其他人列出的所有解决方案的基准:
microbenchmark::microbenchmark(
r1 = gsub("^.*?//(.*)?\\.google.*?$", "\\1", url),
r2 = unlist(regmatches(url, gregexpr("^.*?//\\K(\\w|-)+", url, perl=TRUE))),
r3 = str_extract(url, "(?<=//).*?(?=\\.)"),
r4 = str_extract(url, "(\\w|-)+(?=\\.)"),
r5 = url %>% str_replace("\\w+\\:\\//", "") %>% str_replace("\\.\\w+\\.\\w+", ""),
r6 = url %>% gsub("\\..*","",.) %>% gsub("(https://*?)(\\w+)(*)", "\\2", .),
r7 = sapply(url, function(x) gsub(paste0(scheme(x), "://"), "", host_extract(x)$host), USE.NAMES = FALSE),
times = 1000
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> r1 25.188 36.2695 42.09713 40.5385 44.6705 121.243 1000
#> r2 63.554 93.7230 116.28898 101.6285 116.1940 3407.797 1000
#> r3 20.644 32.5505 41.63846 39.0320 45.1230 183.720 1000
#> r4 32.574 45.7445 57.49725 53.5265 60.0635 662.852 1000
#> r5 305.978 356.8885 422.22098 379.7260 428.6380 4387.231 1000
#> r6 160.318 198.6030 251.32088 216.3115 241.3045 6136.862 1000
#> r7 553.548 612.4135 745.39361 638.5895 720.6745 25381.766 1000
看来最快的是gsub("^.*?//(.*)?\\.google.*?$", "\\1", url)
和str_extract(url, "(?<=//).*?(?=\\.)")
答案 2 :(得分:0)
只需使用以下内容:
url <- c("https://ulla.google.com", "https://test-maps.google.com")
url
[1] "https://ulla.google.com" "https://test-maps.google.com"
url=gsub("\\..*","",url) # Extract everything before first dot (.)
url
[1] "https://ulla" "https://test-maps"
pat = "(https://*?)(\\w+)(*)" # Extract evrything after //
gsub(pat, "\\2", a)
[1] "ulla" "test-maps"
已更新(添加了管道解决方案)
library(stringr)
url %>%
gsub("\\..*","",.) %>%
gsub("(https://*?)(\\w+)(*)", "\\2", .)
答案 3 :(得分:0)
library(stringr)
url <- c("https://ulla.google.com", "https://test-maps.google.com")
# remove front bit
remove_https <- str_replace(url, "\\w+\\:\\//", "")
#remove back bit
just_host <- str_replace(remove_https, "\\.\\w+\\.\\w+", "")
just_host
或设为AS
just_host <- url %>%
str_replace("\\w+\\:\\//", "") %>%
str_replace("\\.\\w+\\.\\w+", "")
just_host