BeautifulSoup找到符合标准的链接

时间:2014-12-24 07:02:10

标签: python html parsing beautifulsoup html-parsing

我正在尝试收集我通过包含/d2l/lp/ouHome/home.d2l?ou=的漂亮汤收集的网页上的所有链接。

实际链接如下所示:

"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234567"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234561"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234564"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234562"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234563"

1 个答案:

答案 0 :(得分:0)

您可以将compiled regular expression作为href参数值传递给find_all()

soup.find_all('a', href=re.compile(r'/d2l/lp/ouHome/home\.d2l\?ou=\d+'))

演示:

>>> import re
>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <div>
...     <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234567">link1</a>
...     <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234561">link2</a>
...     <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234564">link3</a>
...     <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234562">link4</a>
...     <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234563">link5</a>
... </div>
... """
>>> 
>>> soup = BeautifulSoup(data)
>>> links = soup.find_all('a', href=re.compile(r'/d2l/lp/ouHome/home\.d2l\?ou=\d+'))
>>> for link in links:
...     print link.text
... 
link1
link2
link3
link4
link5