如何使用正则表达式捕获正确的重复组?

时间:2014-10-05 16:54:10

标签: python regex

包含两条彼此远离的行的html,如下所示。请注意,这两行的开头有两个相同的字符串。

<a href="http://example.com">file 1</a><a href="http://example.com/right2">file 2</a><a href="http://example.com/right3">file 3</a>

<a href="http://example.com">file 1</a><a href="http://example.com/left2">file 2</a><a href="http://example.com/left3">file 3</a>

我希望正则表达式只给我上面第一行的结果,即

http://example.com
http://example.com/right2
http://example.com/right3

file 1
file 2
file 3

如果我使用这个正则表达式

re.compile('<a href="(.+?)">(.+?)</a>').findall()

然后我

http://example.com
http://example.com/right2
http://example.com/right3
http://example.com
http://example.com/left2
http://example.com/left3

file 1
file 2
file 3
file 1
file 2
file 3

请帮忙。感谢。

1 个答案:

答案 0 :(得分:0)

保存href值。如果您发现重复的属性值,请停止:

>>> import re
>>> matches = re.findall('<a href="(.+?)">(.+?)</a>', html_string)
>>> seen = set()
>>> for href, text in matches:
...     if href in seen:
...         break
...     seen.add(href)
...     print('{} {}'.format(href, text))
...
http://example.com file 1
http://example.com/right2 file 2
http://example.com/right3 file 3

使用Beautiful Soup

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string)
seen = set()
for tag in soup.select('a[href]'):
    if tag['href'] in seen:
        break
    seen.add(tag['href'])
    print('{} {}'.format(tag['href'], tag.text))