我有以下输入字符串:
/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278
我想使用urlparse()
获取域,但在这种情况下获取netloc
属性会返回一个空字符串。
如何提取域名(最佳:没有www)?
想要输出: some-super-domain.de
请注意:有时上面的输入字符串中有没有www !
答案 0 :(得分:1)
我认为urlparse
点给你你想要的东西:
m=re.search(r'(?<=www\.)[a-zA-Z\-]+\.[a-zA-Z]+',s)
print m.group(0)
结果:
some-super-domain.de
试试HERE!
所以,如果您使用urlparse
,结果是:
s='/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'
from urlparse import urlparse
o = urlparse(s)
print o
结果:
ParseResult(scheme='', netloc='', path='/cgi-bin/ivw/CP/dbb_ug_sp', params='', query='r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278', fragment='')
因此,在此结果中,您可以使用o.query
访问域名,但这不是您想要的包含额外字符的内容!
>>>print o.query
>>>r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278
答案 1 :(得分:1)
试试这段代码可以正常工作:
from urlparse import urlparse
import urllib
url = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278';
url= url[url.find('http'):]
url= urllib.unquote(url).decode('utf8')
result= urlparse(url);
domain = '{uri.netloc}'.format(uri=result)
if(domain.find('www.')!=None):
domain=domain[4:]
print (domain);
答案 2 :(得分:0)
答案 3 :(得分:0)
您可以尝试以下使用可变长度lookbehind的代码,
>>> import regex
>>> s = "/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278"""
>>> m = regex.search(r'(?<=https?[^/]*//www\.)[^/]*', s).group()
>>> m
'some-super-domain.de'
或强>
>>> m = re.search(r'(?<=www\.)[^/]*', s).group()
>>> m
'some-super-domain.de'
答案 4 :(得分:0)
import urlparse
import urllib
HTTP_PREFIX = 'http://'
URI = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'
# Unquote the HTTP quoted URI
unquoted_uri = urllib.unquote(URI)
# Parse the URI to get just the URL in the query
queryurl = HTTP_PREFIX + unquoted_uri.split(HTTP_PREFIX)[-1]
# Now you get the hostname you were looking for
parsed_hostname = urlparse.urlparse(queryurl).netloc