Question

我正在试图弄清楚如何在Python中使用BeautifulSoup在网页上查找子目录。我知道如何做到这一点。这就是我的想法：

from bs4 import BeautifulSoup

html = '''<a href="/images/pic.png">images</a>
<a href="google.com">google</a>'''

soup = BeautifulSoup(html)
links = soup.find_all('a', href=True)
for link in links:
    print a['href']

上面会在网页上发布所有链接。我怎样才能打印出子目录，如例如“/images/pic.png”？

虽然我想使用任何其他模块使用beautifulsoup会很好。

Answer 1

为if添加a['href']条件，例如假设子目录在路径中至少有两个/，您可以使用a['href'].count('/') >= 2作为from bs4 import BeautifulSoup html = '''<a href="/images/pic.png">images</a> <a href="google.com">google</a>''' soup = BeautifulSoup(html) links = soup.find_all('a', href=True) for link in links: if a['href'].count('/') >= 2: print a['href']一个条件。

样品：

a['href'].startswith('/')

如果您指的是＆＃34;子目录＆＃34;的相对路径，则可以使用{{1}}作为条件。

Answer 2

您可以解析网址以提取目录路径：

import posixpath
import urlparse
from bs4 import BeautifulSoup

html = '<a href="/images/pic.png">images</a><a href="google.com">google</a>'
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
    dirpath = posixpath.dirname(urlparse.urlparse(a['href']).path)
    if dirpath and dirpath != '/':
       print dirpath #NOTE: urllib.unquote_plus() may introduce `/`

BeautifulSoup寻找子目录

2 个答案: