Question

帖子Get domain name from URL 建议多个库来获得顶级域名。但是

如何在没有其他库的情况下从网页中删除域名？

我用正则表达式试过它似乎有效但我确信有更好的方法可以做到这一点以及许多会打破正则表达式的网址：

>>> import re
>>> url = "https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt"
>>> domain = re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> domain
'stackoverflow.com'
>>> url = "www.apple.com/itune"
>>> re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> 'apple.com'

我也尝试过urlparse但最终只有None：

>>> from urlparse import urlparse
>>> url ='https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'
>>> url = 'www.apple.com/itune'
>>> urlparse(url).hostname
>>>

Answer 1

如何创建一个包裹urlparse的函数？

>>> from urlparse import urlparse
>>>
>>> def extract_hostname(url):
...     components = urlparse(url)
...     if not components.scheme:
...         components = urlparse('http://' + url)
...     return components.netloc
...
>>> extract_hostname('http://stackoverflow.com/questions/22143342')
'stackoverflow.com'
>>> extract_hostname('www.apple.com/itune')
'www.apple.com'
>>> extract_hostname('file:///usr/bin/python')
''

Answer 2

使用urllib.parse标准库。

>>> from urllib.parse import urlparse
>>> url = 'http://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'

我怎样才能从没有额外库的网页中删除域名-python？

2 个答案: