我使用以下代码使用BeautifulSoup检索一堆链接。它返回所有链接,但是我想得到第三个链接,解析那个链接,然后从那个链接获得第三个链接,依此类推。如何修改以下代码来实现这一目标?
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print tag.get('href', None)
print tag.contents[0]
答案 0 :(得分:2)
首先,您应该停止使用BeautifulSoup
版本3 - 它已经很老了,不再维护了。切换到BeautifulSoup
version 4。通过以下方式安装:
pip install beautifulsoup4
并将导入更改为:
from bs4 import BeautifulSoup
然后,您需要使用find_all()
并通过索引递归获取第3个链接,直到页面上没有第3个链接。这是一种方法:
import urllib
from bs4 import BeautifulSoup
url = raw_input('Enter - ')
while True:
html = urllib.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
try:
url = soup.find_all('a')[2]["href"]
# if the link is not absolute, you might need `urljoin()` here
except IndexError:
break # could not get the 3rd link - exiting the loop
答案 1 :(得分:0)
另一种选择是使用css selector,nth-of-type来获取第三个锚点循环,直到css选择返回None:
<script src="https://cdn.datatables.net/u/bs/jq-2.2.3,dt-1.10.12/datatables.min.js"></script>
<script src="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/js/bootstrap.min.js"></script>
<link href="https://cdn.datatables.net/u/bs/jq-2.2.3,dt-1.10.12/datatables.min.css" rel="stylesheet" />
<link href="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/css/bootstrap.min.css" rel="stylesheet" />
<table id="example" class="display" cellspacing="0" width="100%">
<thead>
<tr>
<th>Name</th>
<th>Position</th>
<th>Office</th>
<th>Age</th>
<th>Start date</th>
<th>Salary</th>
</tr>
</thead>
<tfoot>
<tr>
<th>Name</th>
<th>Position</th>
<th>Office</th>
<th>Age</th>
<th>Start date</th>
<th>Salary</th>
</tr>
</tfoot>
<tbody>
<tr>
<td>Tiger Nixon</td>
<td>System Architect</td>
<td>Edinburgh</td>
<td>61</td>
<td>2011/04/25</td>
<td>$320,800</td>
</tr>
<tr>
<td>Garrett Winters</td>
<td>Accountant</td>
<td>Tokyo</td>
<td>63</td>
<td>2011/07/25</td>
<td>$170,750</td>
</tr>
<tr>
<td>Ashton Cox</td>
<td>Junior Technical Author</td>
<td>San Francisco</td>
<td>66</td>
<td>2009/01/12</td>
<td>$86,000</td>
</tr>
<tr>
<td>Cedric Kelly</td>
<td>Senior Javascript Developer</td>
<td>Edinburgh</td>
<td>22</td>
<td>2012/03/29</td>
<td>$433,060</td>
</tr>
<tr>
<td>Airi Satou</td>
<td>Accountant</td>
<td>Tokyo</td>
<td>33</td>
<td>2008/11/28</td>
<td>$162,700</td>
</tr>
<tr>
<td>Brielle Williamson</td>
<td>Integration Specialist</td>
<td>New York</td>
<td>61</td>
<td>2012/12/02</td>
<td>$372,000</td>
</tr>
<tr>
<td>Herrod Chandler</td>
<td>Sales Assistant</td>
<td>San Francisco</td>
<td>59</td>
<td>2012/08/06</td>
<td>$137,500</td>
</tr>
</tbody>
</table>
如果您想找到具有href属性的第三个锚点,可以使用import urllib
from bs4 import BeautifulSoup
url = raw_input('Enter - ')
html = urllib.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
a = soup.select_one("a:nth-of-type(3)")
while a:
html = urllib.urlopen(a["href"])
soup = BeautifulSoup(html, "html.parser")
a = soup.select_one("a:nth-of-type(3)")