Python:可以使用urlparse从cgi bin URL解析域

时间:2014-09-03 09:06:10

标签: python regex

我有以下输入字符串

/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278

我想使用urlparse() 获取域,但在这种情况下获取netloc属性会返回一个空字符串。

如何提取域名(最佳:没有www)?

想要输出: some-super-domain.de

请注意:有时上面的输入字符串中有没有www

5 个答案:

答案 0 :(得分:1)

我认为urlparse点给你你想要的东西:

m=re.search(r'(?<=www\.)[a-zA-Z\-]+\.[a-zA-Z]+',s)
print m.group(0)

结果:

some-super-domain.de

试试HERE

所以,如果您使用urlparse,结果是:

s='/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'

from urlparse import urlparse
o = urlparse(s)
print o

结果:

ParseResult(scheme='', netloc='', path='/cgi-bin/ivw/CP/dbb_ug_sp', params='', query='r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278', fragment='')

因此,在此结果中,您可以使用o.query访问域名,但这不是您想要的包含额外字符的内容!

>>>print o.query
>>>r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278

答案 1 :(得分:1)

试试这段代码可以正常工作:

from urlparse import urlparse
import urllib
url = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278';
url= url[url.find('http'):]
url= urllib.unquote(url).decode('utf8')
result= urlparse(url);
domain = '{uri.netloc}'.format(uri=result)
if(domain.find('www.')!=None):
    domain=domain[4:]
print (domain);

答案 2 :(得分:0)

www\.(.*?)\/

这很有效。参见演示。

http://regex101.com/r/pP3pN1/18

答案 3 :(得分:0)

您可以尝试以下使用可变长度lookbehind的代码,

>>> import regex
>>> s = "/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278"""
>>> m = regex.search(r'(?<=https?[^/]*//www\.)[^/]*', s).group()
>>> m
'some-super-domain.de'

>>> m = re.search(r'(?<=www\.)[^/]*', s).group()
>>> m
'some-super-domain.de'

答案 4 :(得分:0)

import urlparse
import urllib

HTTP_PREFIX = 'http://'
URI = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'

# Unquote the HTTP quoted URI
unquoted_uri = urllib.unquote(URI)

# Parse the URI to get just the URL in the query
queryurl = HTTP_PREFIX + unquoted_uri.split(HTTP_PREFIX)[-1]

# Now you get the hostname you were looking for
parsed_hostname = urlparse.urlparse(queryurl).netloc