以下是直接从espncricinfo.com取得的div标签。
<div id="rectPlyr_Playerlistt20" style="display: none; visibility: hidden;
background:url(http://i.imgci.com/espncricinfo/ciPlayerTablebottom-bg.gif) bottom left no-repeat;">
<table class="playersTable" cellpadding="0" cellspacing="0" style="margin-top:15px; margin-bottom:14px;">
<td class="divider"><a href="/ci/content/player/26421.html">R Ashwin</a></td>
<td class="divider"><a href="/ci/content/player/27223.html">STR Binny</a></td>
<td class=""><a href="/ci/content/player/625383.html">JJ Bumrah</a></td>
</tr>
<tr class="odd">
<td class="divider"><a href="/ci/content/player/430246.html">YS Chahal</a></td>
<td class="divider"><a href="/ci/content/player/290727.html">R Dhawan</a></td>
<td class=""><a href="/ci/content/player/28235.html">S Dhawan</a></td>
</tr>
<tr class="">
<td class="divider"><a href="/ci/content/player/28081.html">MS Dhoni</a></td>
<td class="divider"><a href="/ci/content/player/28671.html">FY Fazal</a></td>
<td class=""><a href="/ci/content/player/28763.html">G Gambhir</a></td>
</tr>
<tr class="odd">
<td class="divider"><a href="/ci/content/player/234675.html">RA Jadeja</a></td>
<td class="divider"><a href="/ci/content/player/290716.html">KM Jadhav</a></td>
<td class=""><a href="/ci/content/player/253802.html">V Kohli</a></td>
</tr>
<tr class="">
<td class="divider"><a href="/ci/content/player/277955.html">DS Kulkarni</a></td>
<td class="divider"><a href="/ci/content/player/326016.html">B Kumar</a></td>
<td class=""><a href="/ci/content/player/398506.html">Mandeep Singh</a></td>
</tr>
<tr class="odd">
<td class="divider"><a href="/ci/content/player/31107.html">A Mishra</a></td>
<td class="divider"><a href="/ci/content/player/481896.html">Mohammed Shami</a></td>
<td class=""><a href="/ci/content/player/290630.html">MK Pandey</a></td>
</tr>
<tr class="">
<td class="divider"><a href="/ci/content/player/554691.html">AR Patel</a></td>
<td class="divider"><a href="/ci/content/player/32540.html">CA Pujara</a></td>
<td class=""><a href="/ci/content/player/277916.html">AM Rahane</a></td>
</tr>
<tr class="odd">
<td class="divider"><a href="/ci/content/player/422108.html">KL Rahul</a></td>
<td class="divider"><a href="/ci/content/player/33141.html">AT Rayudu</a></td>
<td class=""><a href="/ci/content/player/279810.html">WP Saha</a></td>
</tr>
<tr class="">
<td class="divider"><a href="/ci/content/player/236779.html">I Sharma</a></td>
<td class="divider"><a href="/ci/content/player/34102.html">RG Sharma</a></td>
<td class=""><a href="/ci/content/player/537126.html">BB Sran</a></td>
</tr>
<tr class="odd">
<td class="divider"><a href="/ci/content/player/390484.html">JD Unadkat</a></td>
<td class="divider"><a href="/ci/content/player/237095.html">M Vijay</a></td>
<td class=""><a href="/ci/content/player/376116.html">UT Yadav</a></td>
</tr>
<tr class="">
</tr>
</table>
</div>
我想抓取上面的html文件:
from bs4 import BeautifulSoup
import os
import urllib2
BASE_URL = "http://www.espncricinfo.com"
espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6")
soup = BeautifulSoup(espn_ , 'html.parser')
#print soup.prettify().encode('utf-8')
t20 = soup.find_all('div' , {"id" : "rectPlyr_Playerlistt20"})
for row in t20:
print(row.find('tr' , {"class":"odd"}))
让我们假设我已经从上面给出了url的代码。当我刮擦时,我得到的输出为NONE
即使我打印t20我没有得到完整的输出,它只显示直到JJ Bumrah,即只有第一个<tr>
标签。如果您不清楚以上数据,请转到espn_中提供的网址。选择印度队,然后选择t20标签。我想废弃我们在t20标签下看到的所有玩家的href链接。
答案 0 :(得分:1)
html严重破坏,您只需要查看表格的前几行即可看到。您最好的选择是使用 lxml 或 html5lib 作为解析器,只需直接查找锚点并使用步骤切片:
soup = BeautifulSoup(espn_.content , 'html5lib')
t20 = soup.select("#rectPlyr_Playerlistt20 .playersTable td.divider a")
for a in t20[1::2]:
print(a)
这给了你:
<a href="/ci/content/player/27223.html">STR Binny</a>
<a href="/ci/content/player/290727.html">R Dhawan</a>
<a href="/ci/content/player/28671.html">FY Fazal</a>
<a href="/ci/content/player/290716.html">KM Jadhav</a>
<a href="/ci/content/player/326016.html">B Kumar</a>
<a href="/ci/content/player/481896.html">Mohammed Shami</a>
<a href="/ci/content/player/32540.html">CA Pujara</a>
<a href="/ci/content/player/33141.html">AT Rayudu</a>
<a href="/ci/content/player/34102.html">RG Sharma</a>
<a href="/ci/content/player/237095.html">M Vijay</a>