如何从(javascript?)网站上抓取网页?

时间:2017-07-24 13:31:34

标签: python html web-scraping beautifulsoup urllib2

我尝试从名为flightradar24

的网站抓取网页数据

使用我的代码,我正在寻找机场的名称,我想网上抓取“到货”表。 网页抓取名称是有效的,因为这只是一种h1 HTML格式,但如果我尝试使用我的代码对此表进行网络抓取,我没有得到任何值,我只得到对象名称(也许是因为那里)是一个JavaScript?)

是否有任何解决方案,我可以通过网络抓取此页面的这一部分? (Python 2.7)

我试过了:

import urllib2, sys
from BeautifulSoup import BeautifulSoup

site= "https://www.flightradar24.com/data/airports/bud/arrivals"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
name = soup.find('h1' , attrs={'class' : 'airport-name'})
print name

table = soup.find('div', { "class" : "row cnt-schedule-table" })
print table

当我想要打印表格时,我得到了这个:

<div class="row cnt-schedule-table"><label class="m-b-m">ARRIVALS</label><table class="table table-condensed table-hover data-table m-n-t-15"><thead><tr class="hidden-xs hidden-sm"><th class="w-80">TIME</th><th class="w-90">FLIGHT</th><th>FROM</th><th>AIRLINE</th><th class="w-120">AIRCRAFT</th><th class="w-10"></th><th class="w-160">STATUS</th></tr><tr ng-cloak="ng-cloak" data-ng-class="{hidden: btnLoadEarlier === false}" ng-show="(isFetching == false &amp;&amp; airportView.schedule.arrivals.data.length &gt; 0)"> 0)"&gt;<td colspan="7" class="text-center"><button data-mode="arrivals" data-page="-1" data-timestamp="{{currentUtcTimestampRender / 1000}}" ng-click="loadMoreFlights($event)" data-current-page="{{airportView.schedule.arrivals.page.current}}" data-loading-text='&lt;i class="fa fa-circle-o-notch fa-spin"&gt;&lt;/i&gt; Loading earlier flights...' class="btn btn-table-action btn-flights-load">Load earlier flights</button></td></tr></thead><tbody><tr ng-cloak="ng-cloak" class="loader" ng-show="(isFetching == true)"><td colspan="7" class="text-center"><i class="fa fa-spinner fa-pulse"></i> Loading...</td></tr><tr ng-cloak="ng-cloak" ng-show="(isFetching == false &amp;&amp; airportView.schedule.arrivals.data.length == 0)"><td colspan="7" class="text-center">Sorry, we don't have any information about flights for this airport</td></tr><tr ng-cloak="ng-cloak" class="hidden-md hidden-lg" ng-repeat="objFlight in airportView.schedule.arrivals.data track by $index" ng-show="(isFetching == false)"><td colspan="7" class="state-block-{{objFlight.flight.status.generic.status.color || 'gray'}}"><div class="row"><div class="col-xs-12 col-sm-12 p-xxs"><span ng-bind-html="objFlight.flight.statusMessage.text | unsafe"></span> {{objFlight.flight.status.generic.eventTime.utc * 1000 || '' | date: timeFormat: timeZone}}</div></div><div class="row"><div class="col-xs-3 col-sm-3 p-xxs"><i class="fa fa-clock-o"></i> <span>{{objFlight.flight.time.scheduled.arrival * 1000 || '-' | date: timeFormat : timeZone}}</span></div><div class="col-xs-3 col-sm-3 p-xxs"><i class="fa fa-tag"></i> <a class="notranslate" ng-href="/data/flights/{{objFlight.flight.identification.number.default | lowercase}}">{{objFlight.flight.identification.number.default}}</a></div><div class="col-xs-6 col-sm-6 p-xxs"><i class="fa fa-map-marker"></i> <span ng-bind-html="objFlight.flight.airport.origin.position.region.city || '-' | unsafe">{{objFlight.flight.airport.origin.position.region.city}} </span><a class="notranslate" ng-href="/data/airports/{{objFlight.flight.airport.origin.code.iata | lowercase}}" title="{{objFlight.flight.airport.origin.name}}, {{objFlight.flight.airport.origin.position.country.name}}">({{objFlight.flight.airport.origin.code.iata}})</a></div></div><div class="row"><div class="col-xs-3 col-sm-3 p-xxs" title="{{objFlight.flight.aircraft.model.text || ''}}"><i class="fa fa-plane"></i> {{objFlight.flight.aircraft.model.code || '-'}}</div><div class="col-xs-3 col-sm-3 p-xxs"><a ng-show="(objFlight.flight.aircraft.registration)" class="notranslate" ng-href="/data/aircraft/{{objFlight.flight.aircraft.registration | lowercase}}">{{objFlight.flight.aircraft.registration}}</a></div><div class="col-xs-6 col-sm-6 p-xxs">{{ objFlight.flight.airline.name || '-'}}</div></div></td></tr><tr ng-cloak="ng-cloak" class="hidden-xs hidden-sm" ng-repeat="objFlight in airportView.schedule.arrivals.data track by $index" ng-show="(isFetching == false)" data-date="{{(objFlight.flight.time.scheduled.arrival * 1000) | date: 'EEEE, MMM dd' : timeZone}}" tbl-render-directive="tbl-render-directive"><td>{{objFlight.flight.time.scheduled.arrival * 1000 || '-' | date: timeFormat : timeZone}}</td><td class="p-l-s cell-flight-number"><a class="chevron-toggle" ng-if="(objFlight.flight.identification.codeshare != null)" data-codeshare="{{objFlight.flight.identification.codeshare}}"></a> <a class="notranslate" ng-href="/data/flights/{{objFlight.flight.identification.number.default | lowercase}}">{{objFlight.flight.identification.number.default}}</a></td><td><div ng-show="(objFlight.flight.airport.origin)"><span class="hide-mobile-only">{{objFlight.flight.airport.origin.position.region.city}} </span><a class="fs-10 fbold notranslate" ng-href="/data/airports/{{objFlight.flight.airport.origin.code.iata | lowercase}}" title="{{objFlight.flight.airport.origin.name}}, {{objFlight.flight.airport.origin.position.country.name}}">({{objFlight.flight.airport.origin.code.iata}})</a></div><div ng-show="!(objFlight.flight.airport.origin)">-</div></td><td ng-bind-html=" objFlight.flight.airline.name || '-' | unsafe" title="{{ objFlight.flight.airline.name || ''}}" class="cell-airline"></td><td><span class="notranslate" ng-show="(objFlight.flight.aircraft.model.code)">{{objFlight.flight.aircraft.model.code}} </span><a ng-show="(objFlight.flight.aircraft.registration)" class="fs-10 fbold notranslate" ng-href="/data/aircraft/{{objFlight.flight.aircraft.registration | lowercase}}">({{objFlight.flight.aircraft.registration}}) </a><span ng-if="(!objFlight.flight.aircraft.model.code &amp;&amp; !objFlight.flight.aircraft.registration)">-</span></td><td><div class="state-block {{objFlight.flight.status.generic.status.color || 'gray'}}"></div></td><td><span ng-bind-html="objFlight.flight.statusMessage.text | unsafe"></span> {{objFlight.flight.status.generic.eventTime.utc * 1000 || '' | date: timeFormat: timeZone}}</td></tr></tbody><tfoot><tr ng-cloak="ng-cloak" data-ng-class="{hidden: btnLoadLater === false }" ng-show="(isFetching == false &amp;&amp; airportView.schedule.arrivals.data.length &gt; 0 &amp;&amp; airportView.schedule.arrivals.page.current &lt; airportView.schedule.arrivals.page.total)"> 0 &amp;&amp; airportView.schedule.arrivals.page.current &lt; airportView.schedule.arrivals.page.total)"&gt;<td colspan="7" class="text-center"><button data-mode="arrivals" data-page="2" data-timestamp="{{currentUtcTimestampRender / 1000 | int}}" ng-click="loadMoreFlights($event)" data-current-page="{{airportView.schedule.arrivals.page.current}}" data-loading-text='&lt;i class="fa fa-circle-o-notch fa-spin"&gt;&lt;/i&gt; Loading later flights...' class="btn btn-table-action btn-flights-load">Load later flights</button></td></tr><tr ng-cloak="ng-cloak" ng-show="(isFetching == false)"><td colspan="7">* All times are in {{(airportView.schedule.arrivals.data &amp;&amp; timeZone.toUpperCase() == 'UTC' ? 'UTC' : 'local')}} timezone</td></tr></tfoot></table></div>

答案中的语法代码不起作用:

import urllib2
from bs4 import BeautifulSoup
import json

# new url      
url = 'https://www.flightradar24.com/data/airports/bud/arrivals'

# read all data
page = urllib2.urlopen(url).read()

# convert json text to python dictionary
data = json.loads(page)

print(data['row cnt-schedule-table'])

1 个答案:

答案 0 :(得分:0)

Here是另一个堆栈溢出文章,它有一个非常类似问题的解决方案。您似乎需要更改URL以匹配呈现的URL,而不是通常在浏览器中使用的URL。