Question

是否可以只获取特定的网址？

像：

<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>

输出应该只是来自http://www.iwashere.com/

的网址

喜欢，输出网址：

http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

我是通过字符串逻辑完成的。有没有使用BeautifulSoup的直接方法？

Answer 1

您可以匹配多个方面，包括使用正则表达式作为属性值：

import re
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

匹配（例如）：

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

所有带有<a>属性的href代码的值都以字符串http://www.iwashere.com/开头。

您可以遍历结果，只选择href属性：

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

要匹配所有相对路径，请使用负前瞻断言来测试值不是否以schem开头（例如http:或mailto:），或双斜线（//hostname/path）;任何此类值必须是相对路径：

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

Answer 2

如果您正在使用BeautifulSoup 4.0.0或更高版本：

soup.select('a[href^="http://www.iwashere.com/"]')

Answer 3

您可以通过在gazpacho中进行部分匹配来解决此问题：

输入：

html = """\
<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>
"""

代码：

from gazpacho import Soup

soup = Soup(html)
links = soup.find('a', {'href': "http://www.iwashere.com/"}, partial=True)
[link.attrs['href'] for link in links]

将输出：

# ['http://www.iwashere.com/washere.html', 'http://www.iwashere.com/wasnot.html']

Python BeautifulSoup提取特定的URL

3 个答案: