假设我在html文件中有两种类型的链接。我想过滤掉所有类型1的链接。如何使用re
模块在Python中完成?
类型1:
http://www.domain.com/firstlevel/02-02-13/secondlevel-slug.html
类型2:
http://www.domain.com/levelone/02-02-13/secondlevel-slug.html
我想查找包含firstlevel
和secondlevel
的所有链接。
这是我尝试的方式:
import re
text = "here goes the code with various links of type 1 and type 2…"
findURL = re.findall('.*firstlevel.*secondlevel.*',text)
以下是我认为正则表达式的含义:
find all strings that has ONE OR MORE occurances of ANY CHARACTER
followed by the word firstlevel
followed by ONE OR MORE occurances of ANY CHARACTER
followed by the word secondlevel
followed by ONE OR MORE occurances of ANY CHARACTER
但是我得到一个空列表。
我做错了什么?
答案 0 :(得分:1)
您必须确定链接的开始和结束。即。
findURL = re.findall('http:.*firstlevel.*secondlevel.*\.html', text)
答案 1 :(得分:0)
>>> import re
>>> p=re.compile('(http://\S+firstlevel\S+secondlevel\S+\.html)')
>>> text = 'random text http://www.domain.com/firstlevel/02-02-13/secondlevel-slug.html more random text http://www.domain.com/levelone/02-02-13/secondlevel-slug.html'
>>> i = p.finditer(text)
>>> for m in i:
... print(m.group()
...
http://www.domain.com/firstlevel/02-02-13/secondlevel-slug.html
>>>
HTH。