Question

我有以下输入字符串：

/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278

我想使用urlparse() 获取域，但在这种情况下获取netloc属性会返回一个空字符串。

如何提取域名（最佳：没有www）？

想要输出： some-super-domain.de

请注意：有时上面的输入字符串中有没有www ！

Answer 1

我认为urlparse点给你你想要的东西：

m=re.search(r'(?<=www\.)[a-zA-Z\-]+\.[a-zA-Z]+',s)
print m.group(0)

结果：

some-super-domain.de

试试HERE！

所以，如果您使用urlparse，结果是：

s='/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'

from urlparse import urlparse
o = urlparse(s)
print o

结果：

ParseResult(scheme='', netloc='', path='/cgi-bin/ivw/CP/dbb_ug_sp', params='', query='r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278', fragment='')

因此，在此结果中，您可以使用o.query访问域名，但这不是您想要的包含额外字符的内容！

>>>print o.query
>>>r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278

Answer 2

试试这段代码可以正常工作：

from urlparse import urlparse
import urllib
url = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278';
url= url[url.find('http'):]
url= urllib.unquote(url).decode('utf8')
result= urlparse(url);
domain = '{uri.netloc}'.format(uri=result)
if(domain.find('www.')!=None):
    domain=domain[4:]
print (domain);

Answer 3

www\.(.*?)\/

这很有效。参见演示。

http://regex101.com/r/pP3pN1/18

Answer 4

您可以尝试以下使用可变长度lookbehind的代码，

>>> import regex
>>> s = "/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278"""
>>> m = regex.search(r'(?<=https?[^/]*//www\.)[^/]*', s).group()
>>> m
'some-super-domain.de'

或

>>> m = re.search(r'(?<=www\.)[^/]*', s).group() >>> m 'some-super-domain.de'

Answer 5

import urlparse
import urllib

HTTP_PREFIX = 'http://'
URI = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'

# Unquote the HTTP quoted URI
unquoted_uri = urllib.unquote(URI)

# Parse the URI to get just the URL in the query
queryurl = HTTP_PREFIX + unquoted_uri.split(HTTP_PREFIX)[-1]

# Now you get the hostname you were looking for
parsed_hostname = urlparse.urlparse(queryurl).netloc

Python：可以使用urlparse从cgi bin URL解析域

5 个答案: