RegEx用于在Python中匹配URL

时间:2019-05-11 22:01:20

标签: python regex python-2.7 regex-group regex-greedy

我有这个示例字符串:

- name: ec2_prov - set fact for all ci_machine_ips
  set_fact: private_ips="{{ item.private_ip }}"
  with_items: "{{ ci_ec2.instances }}"
  register: ci_ec2_ip_results

我需要提取“标记针”之前的路径(不带斜线)。下面列出所有路径:

line = '[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end'

但是,当我更改它以仅找到所需的路径(“标记针”之前的路径)时,它给出了一个奇怪的输出:

print re.findall('https://www\\.myurl\\.com/(.+?)/', line)
# ['test1', 'test2', 'test3']

我的预期输出:

print re.findall('https://www\\.myurl\\.com/(.+?)/ marker needle', line)
# ['test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3']

我用test3 尝试过相同的操作,但结果是相同的。

1 个答案:

答案 0 :(得分:2)

此表达式具有三个捕获组,其中第二个具有我们所需的输出:

(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)

This tool可以帮助我们修改/更改表达式。

enter image description here

RegEx描述图

jex.im可视化正则表达式:

enter image description here

Python测试

# -*- coding: UTF-8 -*-
import re

string = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end"
expression = r'(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match  ")
else: 
    print(' Sorry! No matches!')

输出

YAAAY! "test3" is a match 

性能测试

此代码段返回一百万次for循环的运行时间。

const repeat = 10;
const start = Date.now();

for (var i = repeat; i >= 0; i--) {
	const regex = /(.*)(https:\/\/www.myurl.com\/)([A-Za-z0-9-]+)(\/\smarker needle)(.*)/gm;
	const str = "[text] something - https://www.myurl.com/test1/ lorem ipsum https://www.myurl.com/test2/ - https://www.myurl.com/test3/ marker needle - some more text at the end";
	const subst = `$3`;

	var match = str.replace(regex, subst);
}

const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match  ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test.  ");