从R中的URL中提取域名的功能

时间:2013-09-26 06:19:11

标签: r

我正在寻找一个可以从R中的URL中提取域名的功能。 任何类似于R中的tldextract的函数? 编辑: 目前我正在使用以下方法:

domain=substr(as.character("www.google.com"), 
   which(strsplit("www.google.com",'')[[1]]=='.')[1]+1, nchar("www.google.com"))

但我正在寻找可以节省编码工作量的预定义功能。

4 个答案:

答案 0 :(得分:18)

您还可以使用相对较新的urltools包:

library(urltools)

URLs <- c("http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r",
          "http://www.talkstats.com/", "www.google.com")

suffix_extract(domain(URLs))

##                host subdomain        domain suffix
## 1 stackoverflow.com      <NA> stackoverflow    com
## 2 www.talkstats.com       www     talkstats    com
## 3    www.google.com       www        google    com

它得到了Rcpp的支持,所以它的速度很快(比使用内置的R apply函数要多得多。

答案 1 :(得分:3)

我不知道包中的一个函数来执行此操作。我不认为R的基本安装中有任何内容。使用用户定义的函数并将其存储在source之后的某个位置,或者使用它创建自己的包。

x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r"
x2 <- "http://www.talkstats.com/"
x3 <- "www.google.com"

domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]]

domain(x3)
sapply(list(x1, x2, x3), domain)
## [1] "stackoverflow.com" "talkstats.com"     "google.com"

答案 2 :(得分:1)

我刚写了这个正则表达式,可以应用于电子邮件和网站,以便在域上提取和匹配。可以修改正则表达式以提取不同的部分,并进行矢量化。我做了一些额外的处理,完全是可选的。

&#13;
&#13;
invalid_domains = "yahoo.com|aol.com|gmail.com|hotmail.com|comcast.net|me.com|att.net|verizon.net|live.com|sbcglobal.net|msn.com|outlook.com|ibm.com"
domain <- function(x){
  to_return <- tolower(as.character(x))
  to_return <- gsub('.*[.@/]+([^.@:/]+[.][a-z]+)(/.*$|$)','\\1',x,ignore.case=T) # extract domain. \\1 selects just the first group.
  to_return <- gsub(".ocm", ".com", to_return) # correct mispellings
  # to_return <- ifelse(grepl(invalid_domains,to_return),NA,to_return) # (optional) check for invalid domains, especially when working with emails.
  to_return <- ifelse(grepl('[.]',to_return),to_return,NA) # if there is no . this is an invalid domain, return NA
  return(to_return)
}
&#13;
&#13;
&#13;

答案 3 :(得分:0)

仅使用基R并在输出中可轻松自定义的矢量化选项可能是

url_regexpr <- function() {
  protocol <- "([^/]+://)*"  # could be
  sub <- "([^\\.\\?/]+\\.)*"  # could be
  domain <- "([^\\.\\?/]+)"  # must be
  dot <- "(\\.)"  # must be
  suffix <- "([^/]+)"  # must be
  folders <- "(/[^\\?]*)*"  # could be
  args <- "(\\?.*)*"  #could be

  paste0(
    "^",
    protocol, sub, domain, dot, suffix, folders, args,
    "$"
  )
}

get_domain <- function(url, include_suffix = TRUE) {
  res <- paste0("\\3", c("\\4\\5")[include_suffix])
  sub(url_regexpr(), res, url)
}

我已经对其进行了以下测试:

library(testthat)
test_that("get_domain works", {
  expect_equal(get_domain("https://www.example.com"), "example.com")
  expect_equal(get_domain("http://www.example.com"), "example.com")
  expect_equal(get_domain("www.example.com"), "example.com")

  expect_equal(get_domain("www.example.net"), "example.net")
  expect_equal(get_domain("www.example.net/baz"), "example.net")

  expect_equal(get_domain("https://www.example.net/baz"), "example.net")
  expect_equal(get_domain("https://www.example.net/baz/tar"), "example.net")

  expect_equal(get_domain("https://foo.example.net"), "example.net")
  expect_equal(get_domain("https://www.foo.example.net"), "example.net")
})

test_that("get_domain is vectorized", {
  urls <- c("www.example.com", "www.example.net")
  expect_equal(get_domain(urls), c("example.com", "example.net"))
})


test_that("can remove suffix", {
  expect_equal(
    get_domain("https://www.example.com", include_suffix = FALSE),
    "example"
  )
})

test_that("works with file extensions", {
  expect_equal(
    get_domain("https://www.example.com/foo.php"),
    "example.com"
  )
})

test_that("works against leading slash", {
  expect_equal(
    get_domain("http://m.example.com/"),
    "example.com"
  )
})

test_that("works against args after slash", {
  expect_equal(
    get_domain("http://example.com/?"),
    "example.com"
  )
})

test_that("works against multiple dots after slash", {
  expect_equal(
    get_domain("http://example.com/foo.net.bar"),
    "example.com"
  )
})

test_that("generalized protocols", {
  expect_equal(
    get_domain("android-app://example.com/foo.net.bar"),
    "example.com"
  )
})