我正在尝试收集我通过包含/d2l/lp/ouHome/home.d2l?ou=
的漂亮汤收集的网页上的所有链接。
实际链接如下所示:
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234567"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234561"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234564"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234562"
"http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234563"
答案 0 :(得分:0)
您可以将compiled regular expression作为href
参数值传递给find_all()
:
soup.find_all('a', href=re.compile(r'/d2l/lp/ouHome/home\.d2l\?ou=\d+'))
演示:
>>> import re
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234567">link1</a>
... <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234561">link2</a>
... <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234564">link3</a>
... <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234562">link4</a>
... <a href="http://learn.ou.edu/d2l/lp/ouHome/home.d2l?ou=1234563">link5</a>
... </div>
... """
>>>
>>> soup = BeautifulSoup(data)
>>> links = soup.find_all('a', href=re.compile(r'/d2l/lp/ouHome/home\.d2l\?ou=\d+'))
>>> for link in links:
... print link.text
...
link1
link2
link3
link4
link5