Python - 无法使用Beautiful Soup或lxml xpath从网页表中检索数据

时间:2017-01-09 19:25:53

标签: python xpath beautifulsoup

我正在尝试从“高级盒子得分统计数据&#34中检索数据;来自以下网页:http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html

我尝试以非常广泛的方式使用BeautifulSoup来检索所有表格:

// Node * next; becomes
std::unique_ptr<Node> next;

// Node<T> newNode = Node<T>(newVal); becomes
newNode = std::make_unique<T>(newVal);

这样做,它只检索了“基本盒子得分统计”。但是,它没有像我希望的那样检索“高级盒子得分统计”。

接下来,我尝试使用lxml路径更具体:

<%= select_tag :sucursal, options_for_select(@sucursales.map{|e| ["#{e.IdEmpresa} - #{e.Sucursal}", e.IdEmpresa]}) %>

这样做,它返回一个空列表。

我一直在努力解决这个问题,并试图通过以下帖子解决这个问题:

提前感谢您提供任何帮助!

2 个答案:

答案 0 :(得分:4)

无需使用selenium和/或PhantomJS&#34;高级盒子得分统计&#34;表实际上在HTML中,它们只是在HTML注释中。解析他们:

import requests
from bs4 import BeautifulSoup, Comment


url = "http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# find the comments containing the desired tables
tables = soup.find_all(text=lambda text: text and isinstance(text, Comment) and 'Advanced Box Score Stats' in text)

# we have 2 tables - one for an opponent team
for table in tables:
    table_soup = BeautifulSoup(table, "html.parser")
    advanced_table = table_soup.select_one("table[id^=box-score-advanced]")
    for row in advanced_table("tr")[2:]:  # skip headers
        print(row.th.get_text())
    print("-------")

从高级表的第一列打印播放器名称:

Nick Lindner
Monty Boykins
Matt Klinewski
Paulius Zalys
Auston Evans
Reserves
Myles Cherry
Kyle Stout
Eric Stafford
Lukas Jarrett
Hunter Janacek
Jimmy Panzini
School Totals
-------
Kris Jenkins
Phil Booth
Josh Hart
Jalen Brunson
Darryl Reynolds
Reserves
Donte DiVincenzo
Mikal Bridges
Eric Paschall
Tim Delaney
Dylan Painter
Denny Grace
Tom Leibig
Matt Kennedy
School Totals
-------

答案 1 :(得分:2)

@snakecharmerb位于正确的路径上:此表在原始html中不存在,必须在运行时由Javascript添加。

这样做:

$ curl http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html | grep "box-score-advanced-lafayette"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9891    0  9891    0     0  45371      0 --:--:-- --:--:-- --:--:-- 48965<div id="all_box-score-advanced-lafayette" class="table_wrapper setup_commented commented">
  <span class="section_anchor" id="box-score-advanced-lafayette_link" data-label="Advanced Box Score"></span>
      <div class="overthrow table_container" id="div_box-score-advanced-lafayette">
  <table class="sortable stats_table" id="box-score-advanced-lafayette" data-cols-to-freeze=1><caption>&nbsp; Table</caption>
100  141k    0  141k    0     0   349k      0 --:--:-- --:--:-- --:--:--  363k

从输出中可以看出,html中存在的所有内容都是构建表的容器。

为了抓住这样的东西,我推荐像Phantom.js这样的方法http://phantomjs.org