正则表达式匹配一组字符,除非其中一个字符在python中是最后一个

时间:2018-06-11 00:47:10

标签: regex python-3.x

我正在尝试从文章中提取网站名称,但有时候句子末尾的域名会有一个不受欢迎的点字符,例如“你可以在www.website.gov.us找到更多信息。去年我们......“

我想获取域名并避免包含最后一个点字符。我在python中的当前正则表达式是:

Regex = r'[www.\w\.]+'

5 个答案:

答案 0 :(得分:2)

感谢大家的回答。这个正则表达式解决了我的问题:

Regex = r'https?://(?:ww\w\.)?([a-zA-Z\d-]+(?:\.[a-zA-Z\d-]+)+)'

答案 1 :(得分:1)

您可以使用以下正则表达式:(?:http://|www.)[^\' ]+

这会获得包含dot character的网站地址,然后使用rstrip('.')将其删除。

"www.website.gov.us.".rstrip('.') => "www.website.gov.us"

答案 2 :(得分:1)

好消息:不需要在python中进行黑客攻击,你可以在正则表达式中完成所有操作!

r'www(\.[a-z]+)+'

首先,匹配'www',然后查找点后跟字母的重复模式。如果您的网址中可能包含大写字母,请将“[a-z]”更改为“[a-zA-Z]”。

答案 3 :(得分:0)

您可以使用以下怪异的正则表达式:

\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b

Demo regex101

此正则表达式将接受以下格式的网址:

<强> INPUT:

add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 192.168.1.1/test.jpg.
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.

<强>输出:

http://mit.edu.com
https://facebook.jp.com
www.google.be
https://www.google.be
www.website.gov.us
www.test.com
http://192.168.1.1/test.jpg
www.test.com:8080/test.jpg
www.website.gov.us/login.html
192.168.1.1/test.jpg
google.co.jp/maps
2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg

<强>说明:

  • \b用于字边界以界定网址和文本的其余部分
  • (?:https?://)?匹配http://或https //(如果存在)
  • (?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})匹配标准网址(可能以www.开头(让我们称之为STANDARD_URL
  • (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)以匹配标准Ipv4(让我们称之为IPv4
  • 匹配IPv6网址:(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]))(我们称之为IPv6
  • 匹配端口部分(我们称之为PORT)如果存在:(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])
  • 匹配网址的(?:/[\w\.-]*)*/?)目标对象部分(html文件,jpg,...)(让我们称之为RESSOURCE_PATH

这提供了以下正则表达式

\b((?:https?://)?(?:STANDARD_URL|IPv4|IPv6)(?:PORT)?(?:RESSOURCE_PATH)\b

<强>来源:

  

IPv6 Regular expression that matches valid IPv6 addresses

     

IPv4 https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9780596802837/ch07s16.html

     

PORT https://stackoverflow.com/a/12968117/8794221

     

其他来源:   https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149

$ more url.py

import re

inputString = """add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 (192.168.1.1/test.jpg).
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg."""

regex=ur"\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b"

matches = re.findall(regex, inputString)
print(matches)

<强>输出:

$ python url.py 
['http://mit.edu.com', 'https://facebook.jp.com', 'www.google.be', 'https://www.google.be', 'www.website.gov.us', 'www.test.com', 'http://192.168.1.1/test.jpg', 'www.test.com:8080/test.jpg', 'www.website.gov.us/login.html', '192.168.1.1/test.jpg', 'google.co.jp/maps', '2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg']

答案 4 :(得分:-1)

使用您当前的正则表达式,您可以这样做:

Regex = r'[www.\w\.]+[^\.]'

它会起作用。 但这不是最好的正则表达式。 Ryan的建议更好,并且会解决它,你也可以将[^\.]添加到它的结尾并跳过rstrip调用。