Question

我有一个用python3编写的程序，该程序每天应解析几个域名并推断数据。
解析后的数据应作为搜索功能，聚合（统计和图表）的输入，并为使用该程序的分析人员节省时间。

就这么知道：我确实没有时间学习机器学习（这里似乎是一个很好的解决方案），所以我选择了已经使用过的正则表达式。
我已经在StackOverflow内部和外部搜索了regex文档，并在regex101上调试器上工作，但我仍然找不到找到所需方法的方法。
编辑（24/6/2019）：我之所以提到机器学习，是因为我需要一个复杂的解析器，即尽可能自动地执行操作。这对于进行自动选择（如黑名单，白名单等）很有用。

解析器应考虑以下几点：

最多126个子域加上TLD
每个子域不得超过64个字符
每个子域只能包含字母数字字符和-字符
每个子域不得以-字符
TLD不得超过64个字符
TLD不能仅包含数字

但是我要更深入一点：

第一个字符串可以（可选）包含“使用类型”，例如cpanel.，mail.，webdisk.，autodiscover.，依此类推...（或者可能是象征www.）
TLD可以（可选）包含.co，.gov，.edu之类的粒子，以此类推（例如，.co.uk）
目前尚未真正对照任何ccTLD / gTLD列表检查TLD的最后一部分，我认为将来不会如此

我认为对解决问题有用的是一个用于可选用法类型的正则表达式组，一个用于每个子域，一个用于TLD（可选粒子必须在TLD组内部）
考虑到这些规则，我想出了一个解决方案：

^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?\.)?([a-z\d][a-z\d\-]{0,62}[a-z\d])?((\.[a-z\d][a-z\d\-]{0,62}[a-z\d]){0,124}?(?P<TLD>(\.co|\.com|\.edu|\.net|\.org|\.gov)?\.(?!\d+)[a-z\d]{1,64})$

上述解决方案未返回预期结果

我在这里举几个例子：

几个要解析的字符串

without.further.ado.lets.travel.the.forest.com  
www.without.further.ado.lets.travel.the.forest.gov.it

我希望找到的组

完全匹配 without.further.ado.lets.travel.the.forest.com
group2 without
group3 further
group4 ado
group5 lets
group6 travel
group7 the
group8 forest
groupTLD .com
完全匹配 www.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGE www.
group2 without
group3 further
group4 ado
group5 lets
group6 travel
group7 the
group8 forest
groupTLD .gov.it

我找到的组

完全匹配 without.further.ado.lets.travel.the.forest.com
group2 without
group3 .further.ado.lets.travel.the.forest
group4 .forest
groupTLD .com
完全匹配 www.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGE www.
group2 without
group3 .further.ado.lets.travel.the.forest
group4 .forest
groupTLD .gov.it
group6 .gov

从示例中可以看到，两次发现了两个粒子，无论如何，这不是我想要的行为。任何尝试编辑公式的尝试都将导致未显示的输出。
有什么办法可以找到预期的结果吗？

Answer 1

这是一个简单的，定义明确的任务。没有模糊，没有复杂性，没有猜测，只有一系列简单的测试就可以确定清单上的所有内容。我不知道“机器学习”如何合适或有用。甚至正则表达式也是完全不需要的。

我还没有实现您要验证的所有内容，但是填写缺失的部分并不难。

import string

double_tld = ['gov', 'edu', 'co', 'add_others_you_need']

# we'll use this instead of regex to check subdomain validity
valid_sd_characters = string.ascii_letters + string.digits + '-'
valid_trans = str.maketrans('', '', valid_sd_characters)

def is_invalid_sd(sd):
    return sd.translate(valid_trans) != ''

def check_hostname(hostname):
    subdomains = hostname.split('.')

    # each subdomain can contain only alphanumeric characters and
    # the - character
    invalid_parts = list(filter(is_invalid_sd, subdomains))
    # TODO react if there are any invalid parts

    # "the TLD can (optionally) contain a particle like
    # .co, .gov, .edu and so on (.co.uk for example)"
    if subdomains[-2] in double_tld:
        subdomains[-2] += '.' + subdomains[-1]
        subdomains = subdomains[:-1]

    # "a maximum number of 126 subdomains plus the TLD"
    # TODO check list length of subdomains

    # "each subdomain must not begin or end with the - character"
    # "the TLD must not be longer than 64 characters"
    # "the TLD must not contain only digits"
    # TODO write loop, check first and last characters, length, isnumeric

    # TODO return something

Answer 2

我不知道是否有可能完全按照您的要求获得输出。我认为使用单一模式无法捕获不同组（group2，group3，..）中的结果。

我找到了一种使用regex模块获得几乎预期结果的方法。

match = regex.search(r'^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?)\.)?(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?\.(?!\d+)[a-z\d]{1,64})$', 'www.without.further.ado.lets.travel.the.forest.gov.it')

输出：

match.captures(0)
['www.without.further.ado.lets.travel.the.forest.gov.it']
match.captures[1] or match.captures('USAGE')
['www.']
match.captures(2)
['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
match.captures(3) or match.captures('TLD')
['gov.it']

在这里，为了避免将.分组，我将其添加到了这样的非捕获组中

(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.)

希望有帮助。

使用Python3在域名正则表达式中进行高级分组

2 个答案: