在beautifulsoup缺少课程

时间:2016-09-16 04:19:35

标签: python html parsing beautifulsoup

我正在尝试在MTA信息页面上抓取一个div。当我抓住html并用BeautifulSoup解析它时,似乎缺少一些数据。

到目前为止,这是我的代码

from bs4 import BeautifulSoup
import urllib # access the web

# SUBWAY STATUS PROJECT
userURL = "http://www.mta.info" # MTA SITE

htmlfile = urllib.urlopen(userURL) #creates html file
htmldoc = htmlfile.read()   #creates html text

soup = BeautifulSoup(htmldoc, 'html.parser')    

subChart = soup.find( id = 'subwayDiv')

print subChart

我正在使用print只是为了确保我获得所有数据。我看到我错过了一些我想要抓住的信息。如果我亲自查看页面,我可以看到我错过了显示地铁状态的班级的div。

我对编程很陌生,所以请注意我的无知

2 个答案:

答案 0 :(得分:0)

在子图变量中查找具有类subwayCategory的元素并存储属性id的值。 例如:来自这部分数据

<div style="float: left; width: 220px; border-bottom: 1px solid #7B7B98; padding: 4px 0;">
<div class="span-11"><img alt="1 2 3 Subway" class="subwayIcon_123" src="http://www.mta.info/sites/all/modules/custom/servicestatus/images/img_trans.gif"/></div>
<div class="subwayCategory" id="123" style="margin-top: 4px;"></div>

具有类subwayCategory的div的id值是123。 现在向http://www.mta.info/status/subway/{ID}

发出请求

将术语{ID}替换为您想要的ID

答案 1 :(得分:0)

使用ajax请求检索数据,您可以获取 json 格式的信息,您只需要传递时间戳即可获得 time.time()然后只需使用json库进行解析:

from time import time
from json import load, loads
import urllib

url  = "http://www.mta.info/service_status_json/{}".format(int(time()))

json_dict = loads(load(urllib.urlopen(url))) 

from pprint import pprint as pp
pp(json_dict)

我没有添加所有输出,因为有太多但是使用我们得到的"BT"

{u'line': [{u'Date': {},
            u'Time': {},
            u'name': u'Bronx-Whitestone',
            u'status': u'GOOD SERVICE',
            u'text': {}},
           {u'Date': {},
            u'Time': {},
            u'name': u'Cross Bay',
            u'status': u'GOOD SERVICE',
            u'text': {}},
           {u'Date': {},
            u'Time': {},
            u'name': u'Henry Hudson',
            u'status': u'GOOD SERVICE',
            u'text': {}},
           {u'Date': u'09/16/2016',
            u'Time': u' 5:57AM',
            u'name': u'Hugh L. Carey',
            u'status': u'SERVICE CHANGE',
            u'text': u"                    <span class='TitleServiceChange' >Service Change</span>                    <span class='DateStyle'>                    &nbsp;Posted:&nbsp;09/16/2016&nbsp; 5:57AM                    </span><br/><br/>                  HLC - HOV Lane Open 6 AM to 10 AM. Two-Way Operations in effect. Three (3) lanes Manhattan-bound. One (1) lane Brooklyn-bound.                <br/><br/>              "},
           {u'Date': {},
            u'Time': {},
            u'name': u'Marine Parkway',
            u'status': u'GOOD SERVICE',
            u'text': {}},
           {u'Date': u'09/16/2016',
            u'Time': u' 5:57AM',
            u'name': u'Queens Midtown',
            u'status': u'SERVICE CHANGE',
            u'text': u"                    <span class='TitleServiceChange' >Service Change</span>                    <span class='DateStyle'>                    &nbsp;Posted:&nbsp;09/16/2016&nbsp; 5:57AM                    </span><br/><br/>                  QMT - HOV Lane Open 6 AM to 10 AM. Two-Way Operation in effect. Three (3) lanes Manhattan bound. One (1) lane Queens bound.                <br/><br/>                                  <span class='TitlePlannedWork' >Planned Work</span>                    <br/>                  <P  style='MARGIN: 0in 0in 0pt'><SPAN style=''Times New Roman';2016; Queens-Midtown Tunnel downtown exit; One lane closed. Use 37<SUP>th</SUP></FONT><FONT size=3> St tunnel exit for access to 2</FONT><SUP><FONT size=3>nd</FONT></SUP><FONT size=3> Ave. Motorists should allow extra time and may wish to use an alternate route if possible' Drivers should expect delays and plan accordingly. Motorists can sign up for MTA e-mail or text alerts at </FONT><SPAN style='COLOR: blue'><A href='http://www.mta.info/'><SPAN style='COLOR: #0563c1'><FONT size=3>www.mta.info</FONT></SPAN></A><FONT size=3> </FONT></SPAN><FONT size=3>and check the Bridges and Tunnels homepage or Facebook page for the latest information on this planned work.</FONT></FONT></SPAN></P>                <br/><br/>                                  <span class='TitlePlannedWork' >Planned Work</span>                    <br/>                  QMT- MANHATTAN PLAZA WORK REQUIRES CLOSURE OF 'CROSSTOWN' LANES FOR 2 MONTHS. CUSTOMERS SEEKING A CROSSTOWN MANHATTAN ROUTE USE THE UPTOWN LANES; EXPECT DELAYS.                <br/><br/>              "},
           {u'Date': u'08/15/2016',
            u'Time': u' 3:56PM',
            u'name': u'Robert F. Kennedy',
            u'status': u'PLANNED WORK',
            u'text': u"                    <span class='TitlePlannedWork' >Planned Work</span>                    <br/>                  <P  style='MARGIN: 0in 0in 0pt'><SPAN style='COLOR: #1f497d'><FONT size=3 face=Calibri>Starting Monday, August 15, 2016 and through early 2018, one lane will be closed on the Queens-to-Manhattan ramp at the Robert F. Kennedy Bridge for roadway rehabilitation. In addition, overnight on Thursday, August 18 and Friday, August 19, there will be a series of intermittent FULL ramp closures, lasting 15-20 minutes each.</FONT></SPAN></P>                <br/><br/>              "},
           {u'Date': {},
            u'Time': {},
            u'name': u'Throgs Neck',
            u'status': u'GOOD SERVICE',
            u'text': {}},
           {u'Date': u'09/16/2016',
            u'Time': u' 5:28AM',
            u'name': u'Verrazano-Narrows',
            u'status': u'PLANNED WORK',
            u'text': u"                    <span class='TitlePlannedWork' >Planned Work</span>                    <br/>                  VNB: PLANNED WORK; S. I. BOUND LOWER LEVEL - ONE LANE CLOSED; EXPECT DELAYS.                <br/><br/>              "}]}

所以你只需要通过字典并挑选出你想要的东西。