使用re.findall()提取url的完美正则表达式

时间:2015-03-03 20:01:46

标签: python regex python-3.x

我正在使用正则表达式来提取网址,但是他们不能在一个示例中工作,或者python解释器只是挂起。

网址是' http://www.computerworld.ru/articles/Naslednik-Hadoop-uskoryaet-evolyutsiyu-analiza-dannyh'

1 个答案:

答案 0 :(得分:0)

使用re.findall进行python的regex:

http[s]?:\/\/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

如果您需要捕获群组:

(http[s]?:\/\/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)


http matches the characters http literally (case sensitive)
[s]? match a single character present in the list
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed
s the literal character s (case sensitive)
: matches the character : literally
\/ matches the character / literally
\/ matches the character / literally
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
[a-zA-Z] match a single character present in the list below
a-z a single character in the range between a and z (case sensitive)
A-Z a single character in the range between A and Z (case sensitive)
2nd Alternative: [0-9]
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
3rd Alternative: [$-_@.&+]
[$-_@.&+] match a single character present in the list below
$-_ a single character in the range between $ and _
@.&+ a single character in the list @.&+ literally (case sensitive)
4th Alternative: [!*\(\),]
[!*\(\),] match a single character present in the list below
!* a single character in the list !* literally
\( matches the character ( literally
\) matches the character ) literally
, the literal character ,
5th Alternative: (?:%[0-9a-fA-F][0-9a-fA-F])
(?:%[0-9a-fA-F][0-9a-fA-F]) Non-capturing group
% matches the character % literally
[0-9a-fA-F] match a single character present in the list below
0-9 a single character in the range between 0 and 9
a-f a single character in the range between a and f (case sensitive)
A-F a single character in the range between A and F (case sensitive)
[0-9a-fA-F] match a single character present in the list below
0-9 a single character in the range between 0 and 9
a-f a single character in the range between a and f (case sensitive)
A-F a single character in the range between A and F (case sensitive)