Question

我使用以下代码使用BeautifulSoup检索一堆链接。它返回所有链接，但是我想得到第三个链接，解析那个链接，然后从那个链接获得第三个链接，依此类推。如何修改以下代码来实现这一目标？

import urllib
from BeautifulSoup import *

url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print tag.get('href', None)
    print tag.contents[0]

Answer 1

首先，您应该停止使用BeautifulSoup版本3 - 它已经很老了，不再维护了。切换到BeautifulSoup version 4。通过以下方式安装：

pip install beautifulsoup4

并将导入更改为：

from bs4 import BeautifulSoup

然后，您需要使用find_all()并通过索引递归获取第3个链接，直到页面上没有第3个链接。这是一种方法：

import urllib
from bs4 import BeautifulSoup

url = raw_input('Enter - ')

while True:
    html = urllib.urlopen(url)
    soup = BeautifulSoup(html, "html.parser")

    try:
        url = soup.find_all('a')[2]["href"]
        # if the link is not absolute, you might need `urljoin()` here
    except IndexError:
        break  # could not get the 3rd link - exiting the loop

Answer 2

另一种选择是使用css selector，nth-of-type来获取第三个锚点循环，直到css选择返回None：

<script src="https://cdn.datatables.net/u/bs/jq-2.2.3,dt-1.10.12/datatables.min.js"></script>
<script src="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/js/bootstrap.min.js"></script>
<link href="https://cdn.datatables.net/u/bs/jq-2.2.3,dt-1.10.12/datatables.min.css" rel="stylesheet" />
<link href="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/css/bootstrap.min.css" rel="stylesheet" />

<table id="example" class="display" cellspacing="0" width="100%">
  <thead>
    <tr>
      <th>Name</th>
      <th>Position</th>
      <th>Office</th>
      <th>Age</th>
      <th>Start date</th>
      <th>Salary</th>
    </tr>
  </thead>
  <tfoot>
    <tr>
      <th>Name</th>
      <th>Position</th>
      <th>Office</th>
      <th>Age</th>
      <th>Start date</th>
      <th>Salary</th>
    </tr>
  </tfoot>
  <tbody>
    <tr>
      <td>Tiger Nixon</td>
      <td>System Architect</td>
      <td>Edinburgh</td>
      <td>61</td>
      <td>2011/04/25</td>
      <td>$320,800</td>
    </tr>
    <tr>
      <td>Garrett Winters</td>
      <td>Accountant</td>
      <td>Tokyo</td>
      <td>63</td>
      <td>2011/07/25</td>
      <td>$170,750</td>
    </tr>
    <tr>
      <td>Ashton Cox</td>
      <td>Junior Technical Author</td>
      <td>San Francisco</td>
      <td>66</td>
      <td>2009/01/12</td>
      <td>$86,000</td>
    </tr>
    <tr>
      <td>Cedric Kelly</td>
      <td>Senior Javascript Developer</td>
      <td>Edinburgh</td>
      <td>22</td>
      <td>2012/03/29</td>
      <td>$433,060</td>
    </tr>
    <tr>
      <td>Airi Satou</td>
      <td>Accountant</td>
      <td>Tokyo</td>
      <td>33</td>
      <td>2008/11/28</td>
      <td>$162,700</td>
    </tr>
    <tr>
      <td>Brielle Williamson</td>
      <td>Integration Specialist</td>
      <td>New York</td>
      <td>61</td>
      <td>2012/12/02</td>
      <td>$372,000</td>
    </tr>
    <tr>
      <td>Herrod Chandler</td>
      <td>Sales Assistant</td>
      <td>San Francisco</td>
      <td>59</td>
      <td>2012/08/06</td>
      <td>$137,500</td>
    </tr>
  </tbody>
</table>

如果您想找到具有href属性的第三个锚点，可以使用import urllib from bs4 import BeautifulSoup url = raw_input('Enter - ') html = urllib.urlopen(url) soup = BeautifulSoup(html, "html.parser") a = soup.select_one("a:nth-of-type(3)") while a: html = urllib.urlopen(a["href"]) soup = BeautifulSoup(html, "html.parser") a = soup.select_one("a:nth-of-type(3)")

你如何从BeautifulSoup结果获得第三个链接

2 个答案: