我正在尝试从以下内容中解析表中的行(离开板时间):
buscms_widget_departureboard_ui_displayStop_Callback("
<div class='\"livetimes\"'>
<table class='\"busexpress-clientwidgets-departures-departureboard\"'>
<thead>
<tr class='\"rowStopName\"'>
<th colspan='\"3\"' data-bearing='\"SW\"' data-lat='\"51.7505683898926\"' data-lng='\"-1.225102186203\"' title='\"oxfajmwg\"'>
Divinity Road
</th>
<tr>
<tr class='\"textHeader\"'>
<th colspan='\"3\"'>
text 69325694 to 84637 for live times
</th>
<tr>
<tr class='\"rowHeaders\"'>
<th>
service
</th>
<th>
destination
</th>
<th>
time
</th>
<tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</thead>
<tbody>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 21:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"5'>
5 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:11:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"27'>
27 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4 (OBC)
</td>
<td class='\"colDestination\"' title='\"Abingdon\"'>
Abingdon
</td>
<td 22:29:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"22:29\"'>
22:29
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"65'>
65 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 23:09:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"23:09\"'>
23:09
</td>
</tr>
</tbody>
</table>
</div>
<div class='\"scrollmessage_container\"'>
<div class='\"scrollmessage\"'>
</div>
</div>
<div class='\"services\"'>
<a class='\"service' href='\"#\"' onclick="\"serviceNameClick('');\"" selected\"="">
all
</a>
<a class='\"service\"' href='\"#\"' onclick="\"serviceNameClick('4');\"">
4
</a>
</div>
<div class="dptime">
<span>
times generated at:
</span>
<span>
21:43
</span>
</div>
");
特别是,我正在尝试提取所有的出发时间 - 所以我想从出发时间算起 - 例如12分钟之后。
我有以下代码:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
我不确定如何找到离开上面的会议记录?它是这样的:
minutes_from_depart = soup.find("tbody", attrs={'td': 'mins'})
答案 0 :(得分:1)
你可以尝试一下吗?
import urllib.request
from bs4 import BeautifulSoup
import re
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
page = urllib.request.urlopen(quote_page).read()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify())
minutes = soup.find_all("td", class_=re.compile(r"colDepartureTime"))
for elements in minutes:
print(elements.getText())
答案 1 :(得分:1)
所以我用以下代码得到了答案 - 一旦我使用soup.find_all
函数,这实际上非常简单:
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('td',class_='\\"colDepartureTime\\"'):
print(link.get_text())
我得到以下输出:
10:40
10 mins
21 mins
30 mins
40 mins
50 mins
60 mins