Question

我有一个在标签中搜索“href”属性的正则表达式，但它目前效果不佳：

<a[^>]* href="([^"]*)"

从中发现：

<a href="http://something" title="Development of the Python language and website">Core Development</a>

这一行：

<a href="http://something"

但我只需要找到：

http://something

Answer 1

这对我有用吗？您可以自己查看工作demo。

matches = re.findall(r'<a[^>]* href="([^"]*)"', html)

相反，我会使用Beautiful Soup来实现这一目标......

from bs4 import BeautifulSoup

html = '''
<a href="http://something" title="Development of the Python language and website">Core Development</a>
<a href="http://something.com" title="Development of the Python language and website">Core Development</a>
'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print a['href']

注意：如果您使用的是旧版的Beautiful Soup，那么请改用以下内容：

for a in soup.findAll('a', href=True):

Answer 2

试试这个：

re.findall(r'(?<=<a href=")[^"]*',yourStr)

Answer 3

不重新发明轮子，您可以使用http://www.crummy.com/software/BeautifulSoup/

$ sudo pip install beautifulsoup4
$ python
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... 
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
... 
... <p class="story">...</p>
... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc)
>>> href = [i.get('href') for i in soup.find_all('a')]
>>> href
['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']

如果不安装beautifulsoup打包，您只需从http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz下载旧版本

$ wget http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz
$ tar xvzf BeautifulSoup-3.2.1.tar.gz
$ cp BeautifulSoup-3.2.1/BeautifulSoup.py .
$ python
>>> import BeautifulSoup

Answer 4

您也可以使用(http[s]?:[^"\s]*)

Answer 5

你可以在re module＆amp;中尝试匹配方法。然后使用小组选择你的比赛

    import re
    str1='''<a href="http://something" title="Development of the Python language and website">Core Development</a>'''
    pattern = re.compile(r'<a.*href="(.*)" ')
    m = pattern.match(str1)
    match = m.group(1)
    print match

正则表达式，找到<a> tags</a>中的所有“href”

5 个答案: