正则表达式不匹配Python中的URL

时间:2013-01-09 02:22:41

标签: python regex

  

可能重复:
  how to extract domain name from URL

我想从网址中提取网站,即来自以下网址的console.aws.amazon.com

>>> ts
'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> re.match(ts,'(")?http(s)?://(.*?)/').group(0)

Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
re.match(ts,'(")?http(s)?://(.*?)/').group(0)
AttributeError: 'NoneType' object has no attribute 'group'

tried this regular expression in JS并且它有效。知道为什么这在JS中匹配,但它在Python中不起作用?

3 个答案:

答案 0 :(得分:5)

你的比赛不正确。 Python doco说:

re.match(pattern, string, flags=0)

你在做:

re.match(string, pattern)

所以只需将其更改为:

 re.match('(")?http(s)?://(.*?)/', ts).group(0)

答案 1 :(得分:5)

使用urlparse

>>> from urlparse import urlparse
>>> u = 'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> p = urlparse(u)
>>> p
ParseResult(scheme='https', netloc='console.aws.amazon.com', path='/ec2/home', params='', query='region=us-east-1', fragment='s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806')
>>> p.netloc
'console.aws.amazon.com'
>>> 

答案 2 :(得分:0)

您可以随时使用str.partition方法:

print(ts.partition('//')[2].partition('/')[0])
>>> console.aws.amazon.com

正则表达式对此有点过分。