我一直在学习使用BeautifulSoup抓取网站,而且一直进展顺利。借用代码,我设法刮掉了某些网站。然而,当我到达下面的网站时,我只返回表格的顶行。我在html中注意到网站编码器已将标题行设为td标记而不是标记,我想知道这是否会导致我的问题。如果是这样,有解决方法吗?我错过了一些明显的东西吗我尝试过使用不同的解析器。
url = 'https://www.twinspires.com/php/brisstats/report.php?bris_id=4061015&report=activity'
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html5lib')
data = []
table = soup.find('table', attrs={'id':'reporttable'})
rows = table.findAll('tr')
for row in rows:
print row
cols = table.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
print data
打印行和数据语句的空闲输出
<tr>
<td class="field-title" width="15%">Activity</td>
<td class="field-title" width="10%">Date</td>
<td class="field-title" width="10%">Track</td>
<td class="field-title" width="9%">Distance</td>
<td class="field-title" width="5%">Surf</td>
<td class="field-title" width="5%">Cond</td>
<td class="field-title" width="9%">Time</td>
<td class="field-title" width="10%">Class</td>
<td class="field-title" width="5%">Fin</td>
<td class="field-title">Comment</td>
</tr>
[[u'Activity', u'Date', u'Track', u'Distance', u'Surf', u'Cond', u'Time', u'Class', u'Fin', u'Comment']]
答案 0 :(得分:-1)
表格中的实际数据是使用javascript填充的,这就是BeautifulSoup看不到的原因。
幸运的是,这个编码器已经硬编码了远程服务的用户名和密码,该服务返回用于填充表格的数据:
<script>
var brisid ='4061015';
$(document).ready(function (){
var crossDomainUrl = 'https://www.twinspires.com/php/fw/php_BRIS_BatchAPI/2.3/Brisstats/activity?bris_id=4061015&username=username&password=password&output=json';
$.ajax({ url: crossDomainUrl,
dataType: 'jsonp',
jsonp: 'jsonpcallback',
jsonpCallback: 'dispdata'
});
});
</script>
使用优秀的requests
库,这很简单:
>>> import requests
>>> url = `https://www.twinspires.com/php/fw/php_BRIS_BatchAPI/2.3/Brisstats/activity?bris_id=4061015&username=username&password=password&output=json`
>>> r = requests.get(url)
>>> data = r.json()
>>> data['activity']['activity-log-proc']['activity-logs']['activity-log'][0]
{u'comment': u'bmp brk,ins to 5/16pl', u'Distance': u'1m', u'Finish': u'4', u'laid_off': [], u'country': u'USA', u'time': [], u'surface': u'T', u'track_id': u'BEL - 09', u'track_condition': u'FM', u'race_number': u'9', u'race_type': u'Race-green', u'day_evening': u'D', u'horse_name': u'Antebellum', u'race_date': u'15Jun16', u'date': u'2016-06-15 00:00:00.0', u'class': u'MCL40000'}
然后,您可以循环浏览data['activity']['activity-log-proc']['activity-logs']['activity-log']
以获得所有结果:
for i in data['activity']['activity-log-proc']['activity-logs']['activity-log']:
print(i['track_condition']) # etc.