BeautifulSoup没有找到所有的“ th”

时间:2019-08-29 21:15:45

标签: python-3.x beautifulsoup html-parsing

我目前正在尝试使用BeautifulSoup在Python 3.7中抓取一个统计站点。我试图从表中获取所有标头作为我的列标头,但是由于某种原因,BeautifulSoup并未获取位于'th'标记内的所有标头。

这是我的代码:

url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
html = urlopen(url)
scraper = BeautifulSoup(html,'html.parser')
column_headers = [th.getText() for th in scraper.findAll('tr', limit=1)[0].findAll('th')] # Find Column Headers.
print(column_headers)

这是我得到的输出: ['#','Player','GP','G','A','TP']

这是我应该得到的输出: ['#','Player','GP','G','A','TP','PIM','+ /-','GP','G','A','TP' ,“ PIM”,“ +/-”]

作为参考,表源html如下所示:

<table class="table table-striped table-sortable skater-stats highlight-stats" data-sort-url="https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats" data-sort-ajax-container="#players" data-sort-ajax-url="https://www.eliteprospects.com/ajax/team.player-stats?teamId=552&amp;season=2005-2006&amp;position=">
                <thead style="background-color: #fff">
                    <tr style="background-color: #fff">
                        <th class="position">#</th>
                        <th class="player sorted" data-sort="player">Player<i class="fa fa-caret-down"></i></th>
                        <th class="gp" data-sort="gp">GP</th>
                        <th class="g" data-sort="g">G</th>
                        <th class="a" data-sort="a">A</th>
                        <th class="tp" data-sort="tp">TP</th>
                        <th class="pim" data-sort="pim">PIM</th>
                        <th class="pm" data-sort="pm">+/-</th>
                        <th class="separator">&nbsp;</th>
                        <th class="playoffs gp" data-sort="playoffs-gp">GP</th>
                        <th class="playoffs g" data-sort="playoffs-g">G</th>
                        <th class="playoffs a" data-sort="playoffs-a">A</th>
                        <th class="playoffs tp" data-sort="playoffs-tp">TP</th>
                        <th class="playoffs pim" data-sort="playoffs-pim">PIM</th>
                        <th class="playoffs pm" data-sort="playoffs-pm">+/-</th>
                    </tr>
                </thead>
                <tbody>

任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:2)

查看您要抓取的页面的源,这正是数据的样子:

    <div class="table-wizard">
        <table class="table table-striped">
            <thead>
                <tr>
                    <th class="position">#</th>
                    <th class="player">Player</th>
                    <th class="gp">GP</th>
                    <th class="g">G</th>
                    <th class="a">A</th>
                    <th class="sorted tp">TP</th>
                </tr>
            </thead>
            <tbody>

这就是为什么这是您获得的唯一数据的原因。事后JavaScript甚至都不修改它。如果我在浏览器控制台中执行querySelector,则会得到相同的结果:

> document.querySelector('tr')
> <tr>
      <th class="position">#</th>
      <th class="player">Player</th>
      <th class="gp">GP</th>
      <th class="g">G</th>
      <th class="a">A</th>
      <th class="sorted tp">TP</th>
  </tr>

简而言之,Beautiful Soup将为您提供第一个th标签中的所有tr标签。

如果您尝试使用CSS选择器tr抓取具有th标签的第二个tr:has(th)标签,则会看到更多th标签:

column_headers = [th.getText() for th in scraper.select('tr:has(th)', limit=2)[1].findAll('th')]

输出

['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', '\xa0', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']   

答案 1 :(得分:0)

由于标签为<table>,所以让熊猫为您完成工作(它在后台使用bs4)。然后,您可以根据需要轻松进行操作:

import pandas as pd

url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
dfs = pd.read_html(url)

headers = list(dfs[1].columns)

print(headers)

输出:

print(headers)
['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', 'Unnamed: 8', 'GP.1', 'G.1', 'A.1', 'TP.1', 'PIM.1', '+/-.1']