Question

我试图从R中的长字符串中捕获域名。域名如下。

11.22.44.55.url.com.localhost

我使用的正则表达式如下，

(gsub("(.*)\\.([^.]*url[^.]*)\\.(.*)","\\2","11.22.44.55.test.url.com.localhost",ignore.case=T)[1])

当我测试它时，我得到了正确的答案

url.com

但是当我在大型数据集上运行它时，（我使用R和Hadoop运行它），结果最终成为了这个，

11.22.44.55.url

有时候域名是

11.22.44.55.test.url.com.localhost

但我永远不会

url.com

我不确定这是怎么发生的。我知道，虽然我单独测试它很好，但在我的实际数据集上运行它失败了。我错过了导致问题的任何角落案件吗？有关数据集的其他信息，每个域地址都是列表中的元素，存储为字符串，我将其解压缩并在其上运行gsub。

Answer 1

此解决方案基于使用sub两次。首先，从字符串中删除".localhost"。然后，提取URL：

# example strings
test <- c("11.22.44.55.url.com.localhost", 
          "11.22.44.55.test.url.com.localhost",
          "11.22.44.55.foo.bar.localhost")


sub(".*\\.(\\w+\\.\\w+)$", "\\1", sub("\\.localhost", "", test))
# [1] "url.com" "url.com" "foo.bar"

此解决方案也适用于以"url.com"结尾的字符串（不含".localhost"）。

Answer 2

为什么不尝试更简单的事情，在.上拆分，然后选择你想要的部分

x <-unlist(strsplit("11.22.44.55.test.url.com.localhost",
    split=".",fixed=T))                   
paste(x[6],x[7],sep=".")

Answer 3

我不是百分之百确定你对比赛的目的是什么，但是这会抓住“url”加上下一个单词/数字序列。我认为“*”通配符太贪心，所以我使用了“+”，它匹配1个或多个字符，而不是0或更多（如“*”）。


>oobar = c(
>"11.22.44.55.url.com.localhost",
>"11.22.44.55.test.url.cog.localhost",
>"11.22.44.55.test.url.com.localhost"
>)

>f = function(url) (gsub("(.+)[\\.](url[\\.]+[^\\.]+)[\\.](.+)","\\2",url,ignore.case=TRUE)) 
>f(oobar)

[1] "url.com" "url.cog" "url.com"

使用正则表达式捕获R中域名的特定部分

3 个答案: