Question

假设我在html文件中有两种类型的链接。我想过滤掉所有类型1的链接。如何使用re模块在Python中完成？

类型1：

http://www.domain.com/firstlevel/02-02-13/secondlevel-slug.html

类型2：

http://www.domain.com/levelone/02-02-13/secondlevel-slug.html

我想查找包含firstlevel和secondlevel的所有链接。

这是我尝试的方式：

import re
text = "here goes the code with various links of type 1 and type 2…"
findURL = re.findall('.*firstlevel.*secondlevel.*',text)

以下是我认为正则表达式的含义：

find all strings that has ONE OR MORE occurances of ANY CHARACTER 
followed by the word firstlevel 
followed by ONE OR MORE occurances of ANY CHARACTER
followed by the word secondlevel 
followed by ONE OR MORE occurances of ANY CHARACTER

但是我得到一个空列表。

我做错了什么？

Answer 1

您必须确定链接的开始和结束。即。

findURL = re.findall('http:.*firstlevel.*secondlevel.*\.html', text)

Answer 2

>>> import re
>>> p=re.compile('(http://\S+firstlevel\S+secondlevel\S+\.html)')
>>> text = 'random text http://www.domain.com/firstlevel/02-02-13/secondlevel-slug.html more random text http://www.domain.com/levelone/02-02-13/secondlevel-slug.html'
>>> i = p.finditer(text)
>>> for m in i:
...    print(m.group()
...
http://www.domain.com/firstlevel/02-02-13/secondlevel-slug.html
>>>

HTH。

如何在Python中使用re查找包含一个单词和另一个单词的URL？

2 个答案: