Question

我试图从这个网址获取一些信息＆＃34; http://baloncestoenvivo.feb.es/Game/1881578＆＃34;。我希望得到一个表格里面的所有信息，这个信息位于一个div中，并带有id = "keyfacts-playbyplay-content-scroll"

我使用以下代码访问此表：

table = page_soup.find(id="keyfacts-playbyplay-content-scroll").findAll("table", {"class" : "twelve even"})

然后，打印＆＃34;表＆＃34;看看我得到了什么，我得到一个没有数据的tr。但是，使用firefox或chrome控制台我们可以看到有799个表行包含数据!!!

这是我打印时所得到的＆＃34; table＆＃34;在python控制台中：

>> table
<table class="twelve even">
<thead>
<tr>
<th colspan="2">Tiempo</th>
<th colspan="2">Cuarto</th>
<th colspan="2">Puntuación</th>
<th colspan="8">Acción</th>
</tr>
</thead>
<tbody>
<!-- ko foreach: LINES -->
<tr>
<td class="text-center" colspan="2" data-bind="text : time"></td>
<td class="text-center" colspan="2" data-bind="text : quarter"></td>
<td colspan="2" data-bind="text : scoreA()==null ? '' : scoreA()+'-'+scoreB()" style="color:#FB0127; text-align: center"></td>
<td colspan="8" data-bind="text : text"></td>
</tr>
<!-- /ko -->
</tbody>
</table>

这就是我们在控制台中可以看到的内容：

为什么不能这样做？所有带有td标签的tr标签都带有信息？

我做错了什么？

Answer 1

表的内容是通过JavaScript动态生成的。这就是页面源没有它们的原因。 requests模块可以在不执行JavaScript的情况下获取页面源，这就是您看到不完整数据的原因。

如果您检查开发工具中XHR标签下的Network标签，则会向http://baloncestoenvivo.feb.es/api/KeyFacts/1881578发送请求，该请求会以JSON格式返回数据。您可以使用requests模块及其内置的.json()函数解析此数据。

唯一的问题是，您需要传递以下标题。没有它们，网站会阻止该脚本，您会看到requests.exceptions.ConnectionError。

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
           'Accept': 'application/json, text/javascript, */*; q=0.01'}

r = requests.get('http://baloncestoenvivo.feb.es/api/KeyFacts/1881578', headers=headers)
data = r.json()

您现在可以从data变量中获取所有表值。要查看其结构，请使用pprint模块。

例如，要获取玩家姓名和对应点，您可以使用：

for player in data['SCOREBOARD']['TEAM'][0]['PLAYER']:
    name = player['name']
    points = player['pts']
    print(name, points)

输出：

A. ELONU 6
L. NICHOLLS GONZALEZ 10
S. DOMINGUEZ FERNANDEZ 13
L. QUEVEDO CAÑIZARES 0
M. ASURMENDI VILLAVERDE 5
F. ABDI 0
E. DE SOUZA MACHADO 13
L. GIL COLLADO 0
K. GIVENS 12
D. MOSS 2
A. ROBINSON 0

Answer 2

背后的原因是我们需要使用像Selenium这样的浏览器模拟器来呈现由javascript生成的动态内容。如果我们尝试仅根据请求请求此数据，我们将无法获得您正在寻找的td。我会在这个库上推荐官方的Selenium文档或Youtube教程，一旦你掌握了这些东西就很容易使用。

Selenium Documentation

from bs4 import BeautifulSoup
import requests


asdf = requests.get('http://baloncestoenvivo.feb.es/Game/1881578').text
soup = BeautifulSoup(asdf, 'lxml')


tabl = soup.find('div',{'id':'keyfacts-playbyplay-content-scroll'}).find('div',{'class':'twelve columns'})

print(tabl)

这不起作用，它只会返回不包含您正在查找的信息的HTML的一部分（即表格元素）

BeautifulSoup返回空的td标签

2 个答案: