域名正则表达式

时间:2014-10-22 23:31:28

标签: xml regex r xpath

尝试从网址中提取域名。例如:

x <-"https://stackoverflow.com/questions/ask"

至:stackoverflow.com

我从这个问题中找到了以下正则表达式。 regex match main domain name

regex <- "([0-9A-Za-z]{2,}\\[0-9A-Za-z]{2,3}\\[0-9A-Za-z]{2,3}|[0-9A-Za-z]{2,}\\[0-9A-Za-z]{2,3})$"

但是当我尝试使用str_extract包中的stringr时,R似乎并不理解。

x2 <- str_extract(x, regex)

3 个答案:

答案 0 :(得分:4)

为什么不使用parseURI中的XML?它将URL分解为不同的元素。

x <- "http://stackoverflow.com/questions/ask"
library(XML)
parseURI(x)$server
# [1] "stackoverflow.com"

答案 1 :(得分:3)

TLD提取并不像您想象的那么简单。已经被视为&#34;公共TLD&#34;的nice list。即什么是有效的真正的顶级域名。我每天都在使用这些(网络安全的挖掘领域)。

我们获得了tldextract R package(更多信息here),可以很好地解析这些内容以进行进一步的数据挖掘。您可以使用parse_url中的httr提取hostname组件,然后在其上运行我们的tldextract功能:

library(httr)
library(rvest)
library(tldextract)

# get some URLs - I encourage you to bump up "10" to "100" or more to see how
# tldextract deals with "public TLDs"
pg <- html("http://httparchive.org/urls.php?start=1&end=10")

# clean up the <pre> output and make it a character list
urls <- pg %>% html_nodes("pre") %>% html_text() %>% strsplit("\n") %>% unlist
urls <- urls[urls != ""] # that site has a blank first line we don't need

# extract the hostname part
urls <- as.character(unlist(sapply(lapply(urls, parse_url), "[", "hostname")))
urls

##  [1] "www.google.com"    "www.facebook.com"  "www.youtube.com"  
##  [4] "www.yahoo.com"     "www.baidu.com"     "www.wikipedia.org"
##  [7] "www.amazon.com"    "www.twitter.com"   "www.qq.com"       
## [10] "www.taobao.com"

# extract the TLDs
tlds <- tldextract(urls)
tlds

##                 host subdomain    domain tld
## 1     www.google.com       www    google com
## 2   www.facebook.com       www  facebook com
## 3    www.youtube.com       www   youtube com
## 4      www.yahoo.com       www     yahoo com
## 5      www.baidu.com       www     baidu com
## 6  www.wikipedia.org       www wikipedia org
## 7     www.amazon.com       www    amazon com
## 8    www.twitter.com       www   twitter com
## 9         www.qq.com       www        qq com
## 10    www.taobao.com       www    taobao com

# piece what we need together
sprintf("%s.%s", tlds$domain, tlds$tld)

##  [1] "google.com"    "facebook.com"  "youtube.com"   "yahoo.com"    
##  [5] "baidu.com"     "wikipedia.org" "amazon.com"    "twitter.com"  
##  [9] "qq.com"        "taobao.com"

答案 2 :(得分:2)

此代码用于获取域名并使用From Regex

java -jar closure.jar --compilation_level "WHITESPACE_ONLY"
  --create_source_map "source-map" --output_manifest "manifest"
  --output_wrapper_file "output" --js_output_file="..\script.js"
  "..\script\header.js" "..\script\**.js" "..\script\footer.js"`

例如

http://www.stackoverflow.com

结果

计算器