Question

我有一个字符串：

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

我想得到这样的结果：

[('https', 'dbwebb.se', ''), ('ftp', 'bth.com', '32'), ('file', 'localhost', '8585'), ('http', 'v2-dbwebb.se', '')]

我试过了：

match = re.findall("([fh]t*ps?|file):[\\/]*(.*?)(:\d+|(?=[\\\/]))", line)

而不是我得到的：

[["https", "dbwebb.se", ""], ["ftp", "bth.com", ":32"], ["file", "localhost", ":8585"], ["http", "v2-dbwebb.se", ""]]

有一个差异，你可以选择＆＃34;：32＆＃34;和＆＃34;：8585＆＃34;。我怎样才能得到正确的＆＃34; 32＆＃34;和＆＃34; 8585＆＃34;而不是愚蠢的＆＃34;：＆＃34; 感谢名单！

Answer 1

我建议

import re
line = line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
match = re.findall(r"([fh]t*ps?|file)://([^/]*?)(?::(\d+))?(?:/|$)", line)
print(match)

请参阅Python demo

主要观点为(?::(\d+))?(?:/|$部分，其中:和1+位数部分是可选的（(?...)?匹配1或0次），(?:/|$)匹配{{1} }或字符串结尾。

<强>详情

/ - 第1组（元组中的第一项）：文字
- ([fh]t*ps?|file) - [fh]t*ps?或f，零个或多个h，t和1或0 p s
- s - 或
- | - file substring
file - 文字子字符串
:// - 第2组（元组中的第二项）：除([^/]*?)以外的任何0个或多个字符
/ - 可选序列：
- (?::(\d+))? - 冒号
- : - 第2组（元组中的第三项）：一个或多个数字
(\d+) - (?:/|$)或字符串结尾。

Answer 2

正则表达式不是解析网址的好工具，有专门的库可以完成这项复杂的任务urllib：

from urllib.parse import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"

result = []
for i in line.split(', '):
    o = urlparse(i)
    result.append([o.scheme, o.hostname, o.port])

Answer 3

而不是正则表达式，为什么不拆分,然后使用Python urllib.parse.urlparse，例如：

from urllib.parse import urlparse

line = "https://dbwebb.se/kunskap/uml#sequence, ftp://bth.com:32/files/im.jpeg, file://localhost:8585/zipit, http://v2-dbwebb.se/do%hack"
output = [urlparse(url) for url in line.split(', ')]

给你：

[ParseResult(scheme='https', netloc='dbwebb.se', path='/kunskap/uml', params='', query='', fragment='sequence'),
 ParseResult(scheme='ftp', netloc='bth.com:32', path='/files/im.jpeg', params='', query='', fragment=''),
 ParseResult(scheme='file', netloc='localhost:8585', path='/zipit', params='', query='', fragment=''),
 ParseResult(scheme='http', netloc='v2-dbwebb.se', path='/do%hack', params='', query='', fragment='')]

然后过滤掉你想要的元素：

wanted = [(url.scheme, url.hostname, url.port or '') for url in output]

这给了你：

[('https', 'dbwebb.se', ''),
 ('ftp', 'bth.com', 32),
 ('file', 'localhost', 8585),
 ('http', 'v2-dbwebb.se', '')]

Python Regex解决方案？

3 个答案: