Question

这是Get protocol + host name from URL的扩展，增加了要求，我只需要域名，而不是子域。

例如，

Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu

Input: mail.google.com
Output: google.com

Input: google.co.uk
Output: google.co.uk

有关更多上下文，我接受用户的一个或多个种子URL，然后在链接上运行抓取抓取工具。我需要域名（没有子域）来设置allowed_urls属性。

我也看过Python urlparse -- extract domain name without subdomain，但那里的答案似乎已经过时了。

我当前的代码使用urlparse，但这也得到了我不想要的子域...

from urllib.parse import urlparse

uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'

是否有一种（希望是stdlib）的方式（仅）获取python-3.x中的域？

Answer 1

我在进行域解析时正在使用tldextract。

在您的情况下，您只需要结合domain + suffix

import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')

从URL获取协议和域（WITHOUT子域）

1 个答案: