美丽的汤忽视网页上的重要内容

时间:2017-03-15 01:12:20

标签: python html parsing css-selectors beautifulsoup

我使用Beautiful Soup来解析这个library hour page。由于今天天气不好,网页会向所有学生显示提醒信息。包含警报消息的HTML代码如下:

<div id="alert-container">
  <div class="alert alert-error">
    <p>The University will resume normal operations on Wednesday, March 15.&nbsp; All Library facilities will be open according to the Spring Break 
schedule. &nbsp;
    <a href="http://hours.cul.columbia.edu/">Library Hours »
    </a>
    </p>
  </div>
</div>
<!--
<div class="alert alert-error" style="margin-bottom:15px;text align:center;">
 <a href="http://library.columbia.edu/news/alert.html">Normal operations are expected to resume Monday, January 25. &nbsp; More information &raquo</a>
</div>
-->

我想解析此警报消息,但事实证明无论我使用lxml还是html5lib,它都会给我错误的解析结果:

<div id="alert-container">
</div>
  <!--
<div class="alert alert-error" style="margin-bottom:15px;text-align:center;">\
  <a href="http://library.columbia.edu/news/alert.html">Normal operations are expected to resume Monday, January 25. &nbsp; More information &raquo
  </a>
</div>
-->

也就是说,它删除了<div id="alert-container"></div>中的所有内容,这对我来说似乎很奇怪。我已经解析了一些网站,这是我第一次遇到这个问题,我想我会按照正确的方式解析网站:

import urllib2
import html5lib
from bs4 import BeautifulSoup
url = "https://hours.library.columbia.edu"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'lxml') #or html5lib
soup.find("div", {"id":"alert-container"})

以及运行上述代码的结果是:

<div id="alert-container"></div>

我想知道这是网站本身的问题还是因为解析器?

提前谢谢!

1 个答案:

答案 0 :(得分:1)

这是因为初始页面中没有任何元素&#34; alert-container&#34;首先,但是通过Ajax请求(&#34; https://api.library.columbia.edu/query.json?qt=alerts&#34;)请求这些元素,它返回一个字符串作为json格式。

此代码应该有效。

import urllib2
import json

url = "https://api.library.columbia.edu/query.json?qt=alerts"
alert = json.load(urllib2.urlopen(url))
print(alert)
print(alert["alerts"][0]["html"])