Question

我有一个域列表，例如

site.co.uk
site.com
site.me.uk
site.jpn.com
site.org.uk
site.it

域名也可以包含第3级和第4级域名，例如

test.example.site.org.uk
test2.site.com

我需要尝试提取二级域名，在所有这些情况下都是site

有什么想法吗？：）

Answer 1

无法可靠地得到它。子域是任意的，并且每天都有一个域扩展的怪物列表。最好的情况是你检查域扩展的怪物列表并维护列表。

列表： http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Answer 2

按照@ kohlehydrat的建议：

import urllib2

class TldMatcher(object):
    # use class vars for lazy loading
    MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1"
    TLDS = None

    @classmethod
    def loadTlds(cls, url=None):
        url = url or cls.MASTERURL

        # grab master list
        lines = urllib2.urlopen(url).readlines()

        # strip comments and blank lines
        lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//']

        cls.TLDS = set(lines)

    def __init__(self):
        if TldMatcher.TLDS is None:
            TldMatcher.loadTlds()

    def getTld(self, url):
        best_match = None
        chunks = url.split('.')

        for start in range(len(chunks)-1, -1, -1):
            test = '.'.join(chunks[start:])
            startest = '.'.join(['*']+chunks[start+1:])

            if test in TldMatcher.TLDS or startest in TldMatcher.TLDS:
                best_match = test

        return best_match

    def get2ld(self, url):
        urls = url.split('.')
        tlds = self.getTld(url).split('.')
        return urls[-1 - len(tlds)]


def test_TldMatcher():
    matcher = TldMatcher()

    test_urls = [
        'site.co.uk',
        'site.com',
        'site.me.uk',
        'site.jpn.com',
        'site.org.uk',
        'site.it'
    ]

    errors = 0
    for u in test_urls:
        res = matcher.get2ld(u)
        if res != 'site':
            print "Error: found '{0}', should be 'site'".format(res)
            errors += 1

    if errors==0:
        print "Passed!"
    return (errors==0)

Answer 3

提取1和2级的混合问题。

琐碎的解决方案......

构建可能的网站后缀列表，从窄到大的顺序排列。 “co.uk”，“uk”，“co.jp”，“jp”，“com”

并检查，可以在域末尾匹配后缀。如果匹配，则下一部分是站点。

Answer 4

使用python tld

https://pypi.python.org/pypi/tld

$ pip install tld

from tld import get_tld, get_fld

print(get_tld("http://www.google.co.uk"))
'co.uk'

print(get_fld("http://www.google.co.uk"))
'google.co.uk'

Answer 5

唯一可行的方法是通过一个包含所有顶级域名（例如.com或co.uk）的列表。然后，您将浏览此列表并查看。我没有看到任何其他方式，至少在运行时没有访问互联网。

Answer 6

@Hugh Bothwell

在你的例子中，你没有处理像parliament.uk这样的特殊域名，它们在文件中用“！”表示。（例如！parliament.uk）

我对你的代码进行了一些更改，也让它看起来更像我以前用过的PHP函数。

还增加了从本地文件加载数据的可能性。

还对某些域进行了测试：

niki.bg，niki.1.bg
parliament.uk
niki.at，niki.co.at
niki.us，niki.ny.us
niki.museum，niki.national.museum
www.niki.uk - 由于Mozilla文件中的“*”，报告为OK。

请随时与我联系@ github，以便我可以将你作为共同作者加入。

GitHub回购在这里：

https://github.com/nmmmnu/TLDExtractor/blob/master/TLDExtractor.py

从域中提取二级域名？ - Python

6 个答案: