Question

我希望能够按照以下规则匹配域：

域名应为a-z | A-Z | 0-9和连字符（ - ）
域名长度应介于1到63个字符之间
最后一个Tld必须至少有两个字符，最多六个字符
域名不应以连字符（ - ）开头或结尾（例如-google.com或google-.com）
域名可以是子域名（例如mkyong.blogspot.com）

我已经拥有了java风格的正则表达式，我只需要这个python风格的

^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$

我找不到任何python正则表达式，因为每个人都希望使用urlparse。我不需要按域，端口，tld等拆分网址，我只需要做一个简单的域替换，所以正则表达式应该是我的解决方案

我做了什么：

expectedstring = re.sub(r"^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$" , "XXX" , string)

示例字符串：

string = "This is why this domain example.com will never be the same after some years, it might just be example.co.uk but will never get to example.-com. Documents could be located in this specific location http://en.example.com/documents/print.doc as you probably already know."

expectedstring = "This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know."

有效域名列表

www.google.com
google.com
mkyong123.com
mkyong-info.com
sub.mkyong.com
sub.mkyong-info.com
mkyong.com.au
g.co
mkyong.t.t.co

无效域名列表，以及原因。

mkyong.t.t.c - Tld必须介于2到6之间
mkyong，com - 逗号不允许
mkyong - No Tld
mkyong.123，Tld不允许数字
.com - 必须以[A-Za-z0-9]
mkyong.com/users - No Tld
mkyong.com - 不能以连字符开头 -
mkyong-.com - 不能以连字符结尾 -
sub.-mkyong.com - 不能以连字符开头 -
sub.mkyong-.com - 不能以连字符结尾 -

Answer 1

我根据给定域名列表（python 2.7x）运行测试：

import re
valid_domains = """
www.google.com
google.com
mkyong123.com
mkyong-info.com
sub.mkyong.com
sub.mkyong-info.com
mkyong.com.au
g.co
mkyong.t.t.co
"""

invalid_domains = """
mkyong.t.t.c
mkyong,com
mkyong
mkyong.123
.com
mkyong.com/users
-mkyong.com
mkyong-.com
sub.-mkyong.com
sub.mkyong-.com
"""

valid_names = valid_domains.split()
invalid_names = invalid_domains.split()

# match 1 character domain name or 2+ domain name
pattern = '^([A-Za-z0-9]\.|[A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9]\.){1,3}[A-Za-z]{2,6}$'

print 'checking valid domain names ============'
for name in valid_names:
    print name.ljust(50), ('True' if re.match(pattern, name) else 'False').rjust(5)

print '\nchecking invalid domain names ============'
for name in invalid_names:
    print name.ljust(50), ('True' if re.match(pattern, name) else 'False').rjust(5)

输出：

checking valid domain names ============
www.google.com                                      True
google.com                                          True
mkyong123.com                                       True
mkyong-info.com                                     True
sub.mkyong.com                                      True
sub.mkyong-info.com                                 True
mkyong.com.au                                       True
g.co                                                True
mkyong.t.t.co                                       True

checking invalid domain names ============
mkyong.t.t.c                                       False
mkyong,com                                         False
mkyong                                             False
mkyong.123                                         False
.com                                               False
mkyong.com/users                                   False
-mkyong.com                                        False
mkyong-.com                                        False
sub.-mkyong.com                                    False
sub.mkyong-.com                                    False

[编辑]为了获得与所提供的expectedstring相同的结果，我想出了以下方法（没有检查＆＃34; http（s）＆＃34;）：

import re

# match 1 character domain name or 2+ domain name
pattern = '(//|\s+|^)(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}'

string = "This is why this domain example.com will never be the same after some years, it might just be example.co.uk but will never get to example.-com. Documents could be located in this specific location http://en.example.com/documents/print.doc as you probably already know."
expectedstring = "This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know."

resultstring = ''.join([re.sub(pattern , "\g<1>XXX" , string)])

print 'resultstring: \n', resultstring
print '\nare they equal? ', expectedstring == resultstring

输出是：

resultstring: 
This is why this domain XXX will never be the same after some years, it might just be XXX but will never get to example.-com. Documents could be located in this specific location http://XXX/documents/print.doc as you probably already know.

are they equal?  True

Python域名正则表达式模式

1 个答案: