Python BeautifulSoup仅返回顶行

时间:2016-06-16 01:12:15

标签: python html beautifulsoup

我一直在学习使用BeautifulSoup抓取网站,而且一直进展顺利。借用代码,我设法刮掉了某些网站。然而,当我到达下面的网站时,我只返回表格的顶行。我在html中注意到网站编码器已将标题行设为td标记而不是标记,我想知道这是否会导致我的问题。如果是这样,有解决方法吗?我错过了一些明显的东西吗我尝试过使用不同的解析器。

    url = 'https://www.twinspires.com/php/brisstats/report.php?bris_id=4061015&report=activity'
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html5lib')

    data = []
    table = soup.find('table', attrs={'id':'reporttable'})
    rows = table.findAll('tr')

    for row in rows:
        print row
        cols = table.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele]) # Get rid of empty values
     print data

打印行和数据语句的空闲输出

<tr>
<td class="field-title" width="15%">Activity</td>
<td class="field-title" width="10%">Date</td>
<td class="field-title" width="10%">Track</td>
<td class="field-title" width="9%">Distance</td>
<td class="field-title" width="5%">Surf</td>
<td class="field-title" width="5%">Cond</td>
<td class="field-title" width="9%">Time</td>
<td class="field-title" width="10%">Class</td>
<td class="field-title" width="5%">Fin</td>
<td class="field-title">Comment</td>
</tr>
[[u'Activity', u'Date', u'Track', u'Distance', u'Surf', u'Cond', u'Time', u'Class', u'Fin', u'Comment']]

1 个答案:

答案 0 :(得分:-1)

表格中的实际数据是使用javascript填充的,这就是BeautifulSoup看不到的原因。

幸运的是,这个编码器已经硬编码了远程服务的用户名和密码,该服务返回用于填充表格的数据:

<script>
    var brisid ='4061015';
    $(document).ready(function (){
      var crossDomainUrl = 'https://www.twinspires.com/php/fw/php_BRIS_BatchAPI/2.3/Brisstats/activity?bris_id=4061015&username=username&password=password&output=json';
      $.ajax({ url: crossDomainUrl,
          dataType: 'jsonp',
          jsonp: 'jsonpcallback',
          jsonpCallback: 'dispdata'
        });
    });
</script>

使用优秀的requests库,这很简单:

>>> import requests
>>> url = `https://www.twinspires.com/php/fw/php_BRIS_BatchAPI/2.3/Brisstats/activity?bris_id=4061015&username=username&password=password&output=json`
>>> r = requests.get(url)
>>> data = r.json()
>>> data['activity']['activity-log-proc']['activity-logs']['activity-log'][0]
{u'comment': u'bmp brk,ins to 5/16pl', u'Distance': u'1m', u'Finish': u'4', u'laid_off': [], u'country': u'USA', u'time': [], u'surface': u'T', u'track_id': u'BEL - 09', u'track_condition': u'FM', u'race_number': u'9', u'race_type': u'Race-green', u'day_evening': u'D', u'horse_name': u'Antebellum', u'race_date': u'15Jun16', u'date': u'2016-06-15 00:00:00.0', u'class': u'MCL40000'}

然后,您可以循环浏览data['activity']['activity-log-proc']['activity-logs']['activity-log']以获得所有结果:

for i in data['activity']['activity-log-proc']['activity-logs']['activity-log']:
  print(i['track_condition'])  # etc.