包含两条彼此远离的行的html,如下所示。请注意,这两行的开头有两个相同的字符串。
<a href="http://example.com">file 1</a><a href="http://example.com/right2">file 2</a><a href="http://example.com/right3">file 3</a>
<a href="http://example.com">file 1</a><a href="http://example.com/left2">file 2</a><a href="http://example.com/left3">file 3</a>
我希望正则表达式只给我上面第一行的结果,即
http://example.com
http://example.com/right2
http://example.com/right3
file 1
file 2
file 3
如果我使用这个正则表达式
re.compile('<a href="(.+?)">(.+?)</a>').findall()
然后我
http://example.com
http://example.com/right2
http://example.com/right3
http://example.com
http://example.com/left2
http://example.com/left3
file 1
file 2
file 3
file 1
file 2
file 3
请帮忙。感谢。
答案 0 :(得分:0)
保存href值。如果您发现重复的属性值,请停止:
>>> import re
>>> matches = re.findall('<a href="(.+?)">(.+?)</a>', html_string)
>>> seen = set()
>>> for href, text in matches:
... if href in seen:
... break
... seen.add(href)
... print('{} {}'.format(href, text))
...
http://example.com file 1
http://example.com/right2 file 2
http://example.com/right3 file 3
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string)
seen = set()
for tag in soup.select('a[href]'):
if tag['href'] in seen:
break
seen.add(tag['href'])
print('{} {}'.format(tag['href'], tag.text))