使用Beautifulsoup 4进行​​网页抓取

时间:2016-10-15 16:04:13

标签: python-2.7 web-scraping beautifulsoup

以下是直接从espncricinfo.com取得的div标签。

<div id="rectPlyr_Playerlistt20" style="display: none; visibility: hidden;
     background:url(http://i.imgci.com/espncricinfo/ciPlayerTablebottom-bg.gif) bottom left no-repeat;">
  <table class="playersTable" cellpadding="0" cellspacing="0" style="margin-top:15px; margin-bottom:14px;">
        <td class="divider"><a href="/ci/content/player/26421.html">R Ashwin</a></td>
        <td class="divider"><a href="/ci/content/player/27223.html">STR Binny</a></td>
        <td class=""><a href="/ci/content/player/625383.html">JJ Bumrah</a></td> 
    </tr>
    <tr class="odd">
      <td class="divider"><a href="/ci/content/player/430246.html">YS Chahal</a></td>
      <td class="divider"><a href="/ci/content/player/290727.html">R Dhawan</a></td>
      <td class=""><a href="/ci/content/player/28235.html">S Dhawan</a></td> 
    </tr>
    <tr class="">
      <td class="divider"><a href="/ci/content/player/28081.html">MS Dhoni</a></td>
      <td class="divider"><a href="/ci/content/player/28671.html">FY Fazal</a></td>
      <td class=""><a href="/ci/content/player/28763.html">G Gambhir</a></td> 
    </tr>
    <tr class="odd">
      <td class="divider"><a href="/ci/content/player/234675.html">RA Jadeja</a></td>
      <td class="divider"><a href="/ci/content/player/290716.html">KM Jadhav</a></td>
      <td class=""><a href="/ci/content/player/253802.html">V Kohli</a></td> 
    </tr>
    <tr class="">
      <td class="divider"><a href="/ci/content/player/277955.html">DS Kulkarni</a></td>
      <td class="divider"><a href="/ci/content/player/326016.html">B Kumar</a></td>
      <td class=""><a href="/ci/content/player/398506.html">Mandeep Singh</a></td> 
    </tr>
    <tr class="odd">
      <td class="divider"><a href="/ci/content/player/31107.html">A Mishra</a></td>
      <td class="divider"><a href="/ci/content/player/481896.html">Mohammed Shami</a></td>
      <td class=""><a href="/ci/content/player/290630.html">MK Pandey</a></td> 
    </tr>
    <tr class="">
      <td class="divider"><a href="/ci/content/player/554691.html">AR Patel</a></td>
      <td class="divider"><a href="/ci/content/player/32540.html">CA Pujara</a></td>
      <td class=""><a href="/ci/content/player/277916.html">AM Rahane</a></td> 
    </tr>
    <tr class="odd">
      <td class="divider"><a href="/ci/content/player/422108.html">KL Rahul</a></td>
      <td class="divider"><a href="/ci/content/player/33141.html">AT Rayudu</a></td>
      <td class=""><a href="/ci/content/player/279810.html">WP Saha</a></td> 
    </tr>
    <tr class="">
      <td class="divider"><a href="/ci/content/player/236779.html">I Sharma</a></td>
      <td class="divider"><a href="/ci/content/player/34102.html">RG Sharma</a></td>
      <td class=""><a href="/ci/content/player/537126.html">BB Sran</a></td> 
    </tr>
    <tr class="odd">
      <td class="divider"><a href="/ci/content/player/390484.html">JD Unadkat</a></td>
      <td class="divider"><a href="/ci/content/player/237095.html">M Vijay</a></td>
      <td class=""><a href="/ci/content/player/376116.html">UT Yadav</a></td> 
    </tr>
    <tr class="">
    </tr>
  </table>
</div>

我想抓取上面的html文件:

from bs4 import BeautifulSoup
import os
import urllib2
BASE_URL = "http://www.espncricinfo.com"
espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6")

soup = BeautifulSoup(espn_ , 'html.parser')

#print soup.prettify().encode('utf-8')
t20 = soup.find_all('div' , {"id" : "rectPlyr_Playerlistt20"})
for row in t20:
 print(row.find('tr' , {"class":"odd"}))

让我们假设我已经从上面给出了url的代码。当我刮擦时,我得到的输出为NONE

即使我打印t20我没有得到完整的输出,它只显示直到JJ Bumrah,即只有第一个<tr>标签。如果您不清楚以上数据,请转到espn_中提供的网址。选择印度队,然后选择t20标签。我想废弃我们在t20标签下看到的所有玩家的href链接。

1 个答案:

答案 0 :(得分:1)

html严重破坏,您只需要查看表格的前几行即可看到。您最好的选择是使用 lxml html5lib 作为解析器,只需直接查找锚点并使用步骤切片:

soup = BeautifulSoup(espn_.content , 'html5lib')

t20 = soup.select("#rectPlyr_Playerlistt20 .playersTable td.divider a")
for a in t20[1::2]:
   print(a)

这给了你:

<a href="/ci/content/player/27223.html">STR Binny</a>
<a href="/ci/content/player/290727.html">R Dhawan</a>
<a href="/ci/content/player/28671.html">FY Fazal</a>
<a href="/ci/content/player/290716.html">KM Jadhav</a>
<a href="/ci/content/player/326016.html">B Kumar</a>
<a href="/ci/content/player/481896.html">Mohammed Shami</a>
<a href="/ci/content/player/32540.html">CA Pujara</a>
<a href="/ci/content/player/33141.html">AT Rayudu</a>
<a href="/ci/content/player/34102.html">RG Sharma</a>
<a href="/ci/content/player/237095.html">M Vijay</a>