我正在尝试在MTA信息页面上抓取一个div。当我抓住html并用BeautifulSoup解析它时,似乎缺少一些数据。
到目前为止,这是我的代码
from bs4 import BeautifulSoup
import urllib # access the web
# SUBWAY STATUS PROJECT
userURL = "http://www.mta.info" # MTA SITE
htmlfile = urllib.urlopen(userURL) #creates html file
htmldoc = htmlfile.read() #creates html text
soup = BeautifulSoup(htmldoc, 'html.parser')
subChart = soup.find( id = 'subwayDiv')
print subChart
我正在使用print只是为了确保我获得所有数据。我看到我错过了一些我想要抓住的信息。如果我亲自查看页面,我可以看到我错过了显示地铁状态的班级的div。
我对编程很陌生,所以请注意我的无知
答案 0 :(得分:0)
在子图变量中查找具有类subwayCategory的元素并存储属性id的值。 例如:来自这部分数据
<div style="float: left; width: 220px; border-bottom: 1px solid #7B7B98; padding: 4px 0;">
<div class="span-11"><img alt="1 2 3 Subway" class="subwayIcon_123" src="http://www.mta.info/sites/all/modules/custom/servicestatus/images/img_trans.gif"/></div>
<div class="subwayCategory" id="123" style="margin-top: 4px;"></div>
具有类subwayCategory的div的id值是123。
现在向http://www.mta.info/status/subway/{ID}
将术语{ID}
替换为您想要的ID
答案 1 :(得分:0)
使用ajax请求检索数据,您可以获取 json 格式的信息,您只需要传递时间戳即可获得 time.time()然后只需使用json库进行解析:
from time import time
from json import load, loads
import urllib
url = "http://www.mta.info/service_status_json/{}".format(int(time()))
json_dict = loads(load(urllib.urlopen(url)))
from pprint import pprint as pp
pp(json_dict)
我没有添加所有输出,因为有太多但是使用我们得到的"BT"
:
{u'line': [{u'Date': {},
u'Time': {},
u'name': u'Bronx-Whitestone',
u'status': u'GOOD SERVICE',
u'text': {}},
{u'Date': {},
u'Time': {},
u'name': u'Cross Bay',
u'status': u'GOOD SERVICE',
u'text': {}},
{u'Date': {},
u'Time': {},
u'name': u'Henry Hudson',
u'status': u'GOOD SERVICE',
u'text': {}},
{u'Date': u'09/16/2016',
u'Time': u' 5:57AM',
u'name': u'Hugh L. Carey',
u'status': u'SERVICE CHANGE',
u'text': u" <span class='TitleServiceChange' >Service Change</span> <span class='DateStyle'> Posted: 09/16/2016 5:57AM </span><br/><br/> HLC - HOV Lane Open 6 AM to 10 AM. Two-Way Operations in effect. Three (3) lanes Manhattan-bound. One (1) lane Brooklyn-bound. <br/><br/> "},
{u'Date': {},
u'Time': {},
u'name': u'Marine Parkway',
u'status': u'GOOD SERVICE',
u'text': {}},
{u'Date': u'09/16/2016',
u'Time': u' 5:57AM',
u'name': u'Queens Midtown',
u'status': u'SERVICE CHANGE',
u'text': u" <span class='TitleServiceChange' >Service Change</span> <span class='DateStyle'> Posted: 09/16/2016 5:57AM </span><br/><br/> QMT - HOV Lane Open 6 AM to 10 AM. Two-Way Operation in effect. Three (3) lanes Manhattan bound. One (1) lane Queens bound. <br/><br/> <span class='TitlePlannedWork' >Planned Work</span> <br/> <P style='MARGIN: 0in 0in 0pt'><SPAN style=''Times New Roman';2016; Queens-Midtown Tunnel downtown exit; One lane closed. Use 37<SUP>th</SUP></FONT><FONT size=3> St tunnel exit for access to 2</FONT><SUP><FONT size=3>nd</FONT></SUP><FONT size=3> Ave. Motorists should allow extra time and may wish to use an alternate route if possible' Drivers should expect delays and plan accordingly. Motorists can sign up for MTA e-mail or text alerts at </FONT><SPAN style='COLOR: blue'><A href='http://www.mta.info/'><SPAN style='COLOR: #0563c1'><FONT size=3>www.mta.info</FONT></SPAN></A><FONT size=3> </FONT></SPAN><FONT size=3>and check the Bridges and Tunnels homepage or Facebook page for the latest information on this planned work.</FONT></FONT></SPAN></P> <br/><br/> <span class='TitlePlannedWork' >Planned Work</span> <br/> QMT- MANHATTAN PLAZA WORK REQUIRES CLOSURE OF 'CROSSTOWN' LANES FOR 2 MONTHS. CUSTOMERS SEEKING A CROSSTOWN MANHATTAN ROUTE USE THE UPTOWN LANES; EXPECT DELAYS. <br/><br/> "},
{u'Date': u'08/15/2016',
u'Time': u' 3:56PM',
u'name': u'Robert F. Kennedy',
u'status': u'PLANNED WORK',
u'text': u" <span class='TitlePlannedWork' >Planned Work</span> <br/> <P style='MARGIN: 0in 0in 0pt'><SPAN style='COLOR: #1f497d'><FONT size=3 face=Calibri>Starting Monday, August 15, 2016 and through early 2018, one lane will be closed on the Queens-to-Manhattan ramp at the Robert F. Kennedy Bridge for roadway rehabilitation. In addition, overnight on Thursday, August 18 and Friday, August 19, there will be a series of intermittent FULL ramp closures, lasting 15-20 minutes each.</FONT></SPAN></P> <br/><br/> "},
{u'Date': {},
u'Time': {},
u'name': u'Throgs Neck',
u'status': u'GOOD SERVICE',
u'text': {}},
{u'Date': u'09/16/2016',
u'Time': u' 5:28AM',
u'name': u'Verrazano-Narrows',
u'status': u'PLANNED WORK',
u'text': u" <span class='TitlePlannedWork' >Planned Work</span> <br/> VNB: PLANNED WORK; S. I. BOUND LOWER LEVEL - ONE LANE CLOSED; EXPECT DELAYS. <br/><br/> "}]}
所以你只需要通过字典并挑选出你想要的东西。