我有一个在标签中搜索“href”属性的正则表达式,但它目前效果不佳:
<a[^>]* href="([^"]*)"
从中发现:
<a href="http://something" title="Development of the Python language and website">Core Development</a>
这一行:
<a href="http://something"
但我只需要找到:
http://something
答案 0 :(得分:7)
这对我有用吗?您可以自己查看工作demo。
matches = re.findall(r'<a[^>]* href="([^"]*)"', html)
相反,我会使用Beautiful Soup来实现这一目标......
from bs4 import BeautifulSoup
html = '''
<a href="http://something" title="Development of the Python language and website">Core Development</a>
<a href="http://something.com" title="Development of the Python language and website">Core Development</a>
'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print a['href']
注意:如果您使用的是旧版的Beautiful Soup,那么请改用以下内容:
for a in soup.findAll('a', href=True):
答案 1 :(得分:3)
试试这个:
re.findall(r'(?<=<a href=")[^"]*',yourStr)
答案 2 :(得分:1)
不重新发明轮子,您可以使用http://www.crummy.com/software/BeautifulSoup/
$ sudo pip install beautifulsoup4
$ python
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
...
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
...
... <p class="story">...</p>
... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc)
>>> href = [i.get('href') for i in soup.find_all('a')]
>>> href
['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']
如果不安装beautifulsoup
打包,您只需从http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz下载旧版本
$ wget http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz
$ tar xvzf BeautifulSoup-3.2.1.tar.gz
$ cp BeautifulSoup-3.2.1/BeautifulSoup.py .
$ python
>>> import BeautifulSoup
答案 3 :(得分:0)
您也可以使用(http[s]?:[^"\s]*)
答案 4 :(得分:0)
你可以在re module&amp;中尝试匹配方法。然后使用小组选择你的比赛
import re
str1='''<a href="http://something" title="Development of the Python language and website">Core Development</a>'''
pattern = re.compile(r'<a.*href="(.*)" ')
m = pattern.match(str1)
match = m.group(1)
print match