Question

如何从网址中提取域名，不包括任何子域？

我最初的简单尝试是：

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

这适用于http://www.foo.com，但不适用于http://www.foo.com.au。有没有办法在不使用有效TLD（顶级域名）或国家/地区代码（因为它们发生变化）的特殊知识的情况下正确地做到这一点。

感谢

Answer 1

这是一个伟大的python模块，有人在看到这个问题后写了解决这个问题： https://github.com/john-kurkowski/tldextract

该模块在Public Suffix List中查找由Mozilla志愿者提供的TLD

引用：

另一方面，
tldextract知道所有gTLD [通用顶级域名] 和ccTLD [国家和地区代码顶级域名]看起来像根据{{3}}查找当前生活的人。因此，给定一个URL，它知道其域中的子域及其域来自其国家/地区代码的域名。

Answer 2

不，没有“内在”的方式知道（例如）zap.co.it是一个子域名（因为意大利的注册商出售co.it等域名zap.co.uk 不是（因为英国的注册商不会销售co.uk等域名，而只会像zap.co.uk那样销售。

你只需要使用一个辅助表（或在线资源）来告诉你哪个顶级域名行为特别像英国和澳大利亚那样 - 没有办法在没有这些额外语义知识的情况下仅仅盯着字符串（当然它最终会改变，但是如果你能找到一个好的在线资源，那么资源也会相应改变，人们希望！ - 。）。

Answer 3

使用this file of effective tlds someone else在Mozilla网站上找到的内容：

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

结果：

abcde.co.uk

如果有人让我知道上面哪些部分可以用更加pythonic的方式重写，我会很感激。例如，必须有一种更好的迭代last_i_elements列表的方法，但我想不到一个。我也不知道ValueError是否是最好的事情。评论

Answer 4

使用python tld

https://pypi.python.org/pypi/tld

安装

pip install tld

从给定的

中获取TLD名称作为字符串

from tld import get_tld
print get_tld("http://www.google.co.uk")

co.uk

或没有协议

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

将TLD作为对象

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

从给定的URL

获取第一级域名作为字符串

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

Answer 5

有很多很多TLD。这是清单：

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

这是另一个清单

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

这是另一个清单

http://www.iana.org/domains/root/db/

Answer 6

以下是我处理它的方式：

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

Answer 7

在更新所有新的get_tld之前，我从错误中拉出tld。当然，它的代码很糟糕，但它确实有效。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

Answer 8

在Python中，我曾经使用 tldextract ，直到它失败，并出现诸如www.mybrand.sa.com之类的网址，并将其解析为subdomain='order.mybrand', domain='sa', suffix='com'！

所以最后，我决定编写此方法，该方法仅适用于其中具有子域的url

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

如何从URL中提取顶级域名（TLD）

8 个答案:

安装

从给定的

将TLD作为对象

从给定的URL