如何使用循环从表中抓取数据以使用python

时间:2016-01-16 20:57:07

标签: python html web-scraping

所以我想从网站上获取一些数据。而且我很难获得数据。我可以得到玩家的名字,但就此而言。一直在尝试不同的事情。以下是我试图通过的示例代码。请注意,有两个表(每个团队一个)。并且每个玩家的班级来自"偶数"到"奇怪"或"奇数"甚至"甚至"下面的示例html文件后跟我的python脚本。我标记了我想要的部分。我也在使用python 2.7

`<table id="nbaGITeamStats" cellpadding="0" cellspacing="0">
      <thead class="nbaGIClippers">
         <tr>
            <th colspan="17">Los Angeles Clippers (1-0)</th> <!-- I want team name  -->
         </tr>
      </thead>
      <tbody><tr colspan="17">
         <td colspan="17" class="nbaGIBoxCat"><span>field goals</span><span>rebounds</span></td>
      </tr>
      <tr>
     <td class="nbaGITeamHdrStatsNoBord" colspan="1">&nbsp;</td>
     <td class="nbaGITeamHdrStats">pos</td>
     <td class="nbaGITeamHdrStats">min</td>
     <td class="nbaGITeamHdrStats">fgm-a</td>
     <td class="nbaGITeamHdrStats">3pm-a</td>
     <td class="nbaGITeamHdrStats">ftm-a</td>
     <td class="nbaGITeamHdrStats">+/-</td>
     <td class="nbaGITeamHdrStats">off</td>
     <td class="nbaGITeamHdrStats">def</td>
     <td class="nbaGITeamHdrStats">tot</td>
     <td class="nbaGITeamHdrStats">ast</td>
     <td class="nbaGITeamHdrStats">pf</td>
     <td class="nbaGITeamHdrStats">st</td>
     <td class="nbaGITeamHdrStats">to</td>
     <td class="nbaGITeamHdrStats">bs</td>
     <td class="nbaGITeamHdrStats">ba</td>
     <td class="nbaGITeamHdrStats">pts</td>
  </tr>
  <tr class="odd">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/paul_pierce/index.html">P. Pierce</a></td> <!-- I want player name  -->
     <td class="nbaGIPosition">F</td> <!-- I want position name  -->
     <td>14:16</td> <!-- I want this  -->
     <td>1-4</td>  <!-- I want this  -->
     <td>1-2</td>  <!-- I want this  -->
     <td>2-2</td>  <!-- I want this  -->
     <td>+12</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>3</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
  </tr>

  <tr class="even">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/blake_griffin/index.html">B. Griffin</a></td>  <!-- I want this  -->
     <td class="nbaGIPosition">F</td>  <!-- I want this  -->
     <td>26:19</td>  <!-- I want this  -->
     <td>5-14</td>  <!-- I want this  -->
     <td>0-1</td>  <!-- I want this  -->
     <td>1-1</td>  <!-- I want this  -->
     <td>+14</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>11</td>  <!-- I want this  -->
  </tr>
  <tr class="odd">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/deandre_jordan/index.html">D. Jordan</a></td>  <!-- I want this  -->
     <td class="nbaGIPosition">C</td>  <!-- I want this  -->
     <td>26:27</td>  <!-- I want this  -->
     <td>6-7</td>  <!-- I want this  -->
     <td>0-0</td>  <!-- I want this  -->
     <td>3-5</td>  <!-- I want this  -->
     <td>+19</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>11</td>  <!-- I want this  -->
     <td>12</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>3</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>15</td>  <!-- I want this  -->
  </tr>
   <!-- And so on it will keep changing class from odd to even, even to odd  -->
    <!-- Also note there are to tables one for each team  -->
   <!--this is he table id>>> <table id="nbaGITeamStats" cellpadding="0" cellspacing="0"> -->`

这很长但是我想举一个关于这里切换的类的例子是我的python脚本我计划在实际成功地将其删除后使用字典来保存数据。

import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page)
   for tr in soup.find_all('table id="nbaGITeamStats'):
    tds = tr.find_all('td')
    print tds

3 个答案:

答案 0 :(得分:2)

这样写是正确的:

for tr in soup.find_all('table', id='nbaGITeamStats')

对我来说这很好(python 3.4):

>>> import requests
>>> from bs4 import BeautifulSoup
>>> gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
>>> 
>>> for game in gamesForDay:
...    url =  "http://www.nba.com/"+game
...    page = requests.get(url).content
...    soup = BeautifulSoup(page, 'html.parser')
...    for tr in soup.find_all('table', id='nbaGITeamStats'):
...        tds = tr.find_all('td')
...        print(tds)

要访问 td 标记内的内容,请使用.text,如下所示:

for td in tds:
   print(td.text)

答案 1 :(得分:2)

这是我的解决方案。请注意,我有一个稍微不同的BeautifulSoup版本,而不是来自bs4的版本,但逻辑可能不会太过关闭。仍然在Python2.7(在我的情况下在Windows上)。

您可能需要修复一些与上面显示不同的玩家部分的细微差别,但我认为您将能够处理该部分: - )

import urllib
import urllib2
# from bs4 import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page)

   # fetch the tables you are interested in
   tables = soup.findAll(id="nbaGITeamStats")
   for table in tables:
       team_name = table.thead.tr.th.text
       # odd/even class rows (tr)
       rows = [ x for x in table.findAll('tr') if x.get('class',None) in ['odd','even'] ]
       for player in rows:
           # search the row cols based on 'id'
           player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

           # search the row cols based on 'class'
           player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text

           # search for all td where the class is not defined
           player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]

           print player_name, player_position, player_numbers

使用bs4(我学到的BeautifulSoup4)必须进行一些修改。你仍然需要处理一些东西,但这会提取你想要的大部分数据:

import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page, "html.parser")

   # fetch the tables you are interested in
   tables = soup.findAll(id="nbaGITeamStats")
   for table in tables:
       team_name = table.thead.tr.th.text
       # odd/even class rows (tr)
       rows = table.find_all(attrs={'class':'odd'})
       rows.extend(table.find_all(attrs={'class':'even'}))

       for player in rows:
           # search the row cols based on 'id'
           player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

           # search the row cols based on 'class'
           player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text

           # search for all td where the class is not defined
           player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]

           print player_name, player_position, player_numbers

答案 2 :(得分:1)

所以这就是我做的所有事情。当然,我必须从这里清理代码,这得到了sal的大力帮助。

import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page, "html.parser")

   # fetch the tables you are interested in
   tables = soup.findAll(id="nbaGITeamStats")
   for table in tables:
        team_name = table.thead.tr.th.text
        # odd/even class rows (tr)
        rowsodd = table.find_all(attrs={'class':'odd'})
        rowseven =table.find_all(attrs={'class':'even'})

        for player in rowsodd:
            # search the row cols based on 'id'
            player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

            # search the row cols based on 'class'
            #player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
            #^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
            # search for all td where the class is not defined
            player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]

            print player_name, player_numbers
        for player in rowseven:
            # search the row cols based on 'id'
            player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text

            # search the row cols based on 'class'
            #player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
             #^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
            # search for all td where the class is not defined
            player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
            print player_name, player_numbers

现在一切都出现了。我将不得不更好地清理它。但数据更清洁。从问题中你可以看出,我实际上从未使用过美丽的汤。需要两行或者也许有人知道更好的方式,这对我来说最容易获得我一直在寻找改进的数据。我希望有人从中学到这一点。