指定要在页面上绘制哪个表

时间:2018-05-27 21:20:51

标签: python web-scraping beautifulsoup

相关网页是:http://stats.nba.com/player/2544/shots-dash/?Season=2017-18&SeasonType=Playoffs&LastNGames=6&sort=FGM&dir=1

我试图刮掉第五张桌子。

我首先尝试通过执行以下操作来获取列标题:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://stats.nba.com/player/2544/shots-dash/?Season=2017-18&SeasonType=Playoffs&LastNGames=6'
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

column_headers = [th.getText() for th in soup.findAll('tr')[1].findAll('th')]

data_rows = soup.findAll('tr')[1:]

然后我收到了IndexError: list index out of range错误,我的data_rows空了。

这反过来让我意识到这个页面上每个表格的所有标签都是相同的,所以我不确定如何指定我想要的确切表格......

1 个答案:

答案 0 :(得分:0)

您提供的网站是Angular内建的JavaScript framework。因此,在Angular填充之前加载HTML,只会为您提供网站的头骨,这是一个没有您需要的数据的基本HTML。

因此,要回答您的问题,您必须知道目标背后的逻辑,女巫是使用Angular构建的应用程序。通过一些研究,您会发现Angular基本上是API消费者。

根据您的浏览器发出的请求,您将找到一个API端点,Angular用于填充您提供的链接表。

以下是如何解析API端点并获取所需数据的示例:

import json
from urllib.request import urlopen, Request 


# API endpoint where Angular gets the data and it'll fill the tables with
url = 'http://stats.nba.com/stats/playerdashptshots?DateFrom=&DateTo=&GameSegment=&LastNGames=6&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PerMode=PerGame&Period=0&PlayerID=2544&Season=2017-18&SeasonSegment=&SeasonType=Playoffs&TeamID=0&VsConference=&VsDivision='

# The server behind the website checks the headers
# So you need to add to your request at least 
# a valid Cookie and a user agent
headers = {
    'Cookie': 'ak_bmsc=98B4FD680504382D1BE219CE963DE520ADDE6D86FF260000CD220B5B87B91E5E~ploZrwUVSrpu3HO/7DratALkZS/cK+SOZ9zMvNJNvJ6u/dYH50zISBdr3kK2S6ifBH/zXh9Z8oBFFeq1so2FGYfl29Zob9z065l/0caXBy5CNT3gOCn3OojgRPe7j1LLDThGl7eYQju8bl+1dO24vr5r9U+YngrmtlXpUPX+IT6Z7YoJPXP9YHmx1FMCyr7FOKmTyJL7js91F1pGVKGEOE/plhHEB4P7sq3B0uRzWWWcc=; s_cc=true; s_fid=5CA44E6CD67BA096-0F573C9E17BE0B09; s_sq=%5B%5BB%5D%5D; bm_sv=682C33C8686155B97E1B1692275AF96F~HweJkyeagLOu7iHDyl4xgtUAYOpT0NW49tH2OZpG93uH9+RTvrfGREItatT/72/WL3cY/k2VeYr/tDO1feFxAvO+Xe8fzIw2JOH4A/0lRXqp709dcJb53l9AytLTOgoHaQ4UG7rjPBPyMSoFFeJ3tg==',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
}
# Building the request
request = Request(url, headers=headers)
with urlopen(request, timeout=3) as f:
    data = f.read().decode('utf8')

# Convert the JSON to a valid Python dict
data_json = json.loads(data)
# Get the data of the 5th table
desired_table = data_json['resultSets'][5]
# Get the headers of the 5th table
data_headers = desired_table.get('headers')

# Print them out ...
# PS: You need to find how to pretty print them
# Like what the website do
# For example: Convert some floats to percentage etc..
for elm in desired_table.get('rowSet'):
    for head, val in zip(data_headers, elm):
        print('{0} : {1}'.format(head, val))
    print('#'*20)

输出(第5个表的第一行):

PLAYER_ID : 2544
PLAYER_NAME_LAST_FIRST : James, LeBron
SORT_ORDER : 1
GP : 6
G : 1
CLOSE_DEF_DIST_RANGE : 0-2 Feet - Very Tight
FGA_FREQUENCY : 0.007
FGM : 0.0
FGA : 0.17
FG_PCT : 0.0
EFG_PCT : 0.0
FG2A_FREQUENCY : 0.007
FG2M : 0.0
FG2A : 0.17
FG2_PCT : 0.0
FG3A_FREQUENCY : 0.0
FG3M : 0.0
FG3A : 0.0
FG3_PCT : None
####################
...

为了了解如何打印这些值,网站会使用带标签的模板[访问此link]。

有趣的部分是:

  <tbody>
    <tr data-ng-repeat="(i, row) in page" index="{{ ::i }}">
      <td class="player">
        <span ng-if="title==='Overall'">{{ ::row.SHOT_TYPE }}</span>
        <span ng-if="title==='GeneralShooting'">{{ ::row.SHOT_TYPE }}</span>
        <span ng-if="title==='ShotClockShooting'">{{ ::row.SHOT_CLOCK_RANGE }}</span>
        <span ng-if="title==='ClosestDefenderShooting'">{{ ::row.CLOSE_DEF_DIST_RANGE }}</span>
        <span ng-if="title==='ClosestDefender10ftPlusShooting'">{{ ::row.CLOSE_DEF_DIST_RANGE }}</span>
        <span ng-if="title==='DribbleShooting'">{{ ::row.DRIBBLE_RANGE }}</span>
        <span ng-if="title==='TouchTimeShooting'">{{ ::row.TOUCH_TIME_RANGE }}</span>
      </td>
      <td>{{ ::row.GP }}</td>
      <td>{{ ::row.G }}</td>
      <td>{{ ::row.FGA_FREQUENCY | percent }}%</td>
      <td>{{ ::row.FGM | permode:params.PerMode }}</td>
      <td>{{ ::row.FGA | permode:params.PerMode }}</td>
      <td>{{ ::row.FG_PCT | percent }}</td>
      <td>{{ ::row.EFG_PCT | percent }}</td>
      <td>{{ ::row.FG2A_FREQUENCY | percent }}%</td>
      <td>{{ ::row.FG2M | permode:params.PerMode }}</td>
      <td>{{ ::row.FG2A | permode:params.PerMode }}</td>
      <td>{{ ::row.FG2_PCT | percent }}</td>
      <td>{{ ::row.FG3A_FREQUENCY | percent }}%</td>
      <td>{{ ::row.FG3M | permode:params.PerMode }}</td>
      <td>{{ ::row.FG3A | permode:params.PerMode }}</td>
      <td>{{ ::row.FG3_PCT | percent }}</td>
    </tr>
  </tbody> 

最后,要善于使用网站,不要淹没或发送垃圾邮件。

编辑:我在Github网站上发现了这个API schema,它解释了API端点中的数据。您可以阅读它以了解如何处理JSON的输出。