Question

我知道过去也有类似的问题，但是没有一种解决方案真正适用于所有情况。

到目前为止，我已经构建了此正则表达式：

(http(s)?:\/\/)?(www\.)?([a-zA-Z\-]+\.[a-z-A-Z\.]+)

它适用于所有这些示例（提取google.com）：

https://www.google.com/something/something
https://google.com/something/something
https://www.google.com/
https://google.com/
https://www.google.com
https://google.com
www.google.com
google.com
http://www.google.com/something/something
http://google.com/something/something
http://www.google.com/
http://google.com/
http://www.google.com
http://google.com
http://www.google.com.hk
http://google.com.hk

但是，此示例不起作用（提取出mail.google.com）：

http://mail.google.com

我不能简单地将正则表达式更改为(http:\/\/|https:\/\/)?([a-zA-Z]+\.)?([a-zA-Z\-]+\.[a-z-A-Z\.]+)，因为这将导致http://google.com.hk与com.hk匹配。

有什么想法吗？谢谢。

Answer 1

使用我在上面的评论中概述的方法，您将需要获取所有的满足条件，然后朝着域名的开头努力：

^(?:(?:https?://)?(?:(?:\w+\.)*?(\w+\.(com\.hk|co\.uk|com|net|org|hk)\b))).*

请注意，末尾的列表需要按降序排列！

您将需要在最后扩展列表，并且可以通过消除回溯来使regexp更快一些，但是它可以与上述测试用例一起使用：

#!perl
use strict;
use warnings;

while (<DATA>) {
    if( m!^(?:(?:https?://)?(?:(?:\w+\.)*?(\w+\.(com\.hk|co\.uk|com|net|org|hk)\b))).*! ) {
        print "$1\n";
    } else {
        die "Failed '$_'";
    }
}

__DATA__
https://www.google.com/something/something
https://google.com/something/something
https://www.google.com/
https://google.com/
https://www.google.com
https://google.com
www.google.com
google.com
http://www.google.com/something/something
http://google.com/something/something
http://www.google.com/
http://google.com/
http://www.google.com
http://google.com
http://www.google.com.hk
http://google.com.hk
http://google.hk

Fiddle

正则表达式在所有情况下都从URL剥离域名？

1 个答案: