在Python中导航HTML树

时间:2014-03-24 10:43:30

标签: python html beautifulsoup

<td id="aisd_calendar-2014-04-28-0" class="single-day future" colspan="1" rowspan="1" date="**2014-04-28**" >
  <div class="inner">
    <div class="item">
  <div class="view-item view-item-aisd_calendar">
  <div class="calendar monthview">
        <div class="calendar.4168.field_date.8.0 contents">
                      <a href="/event/2013/regular-board-meeting">**Regular Board Meeting**</a>                      <span class="date-display-single">7:00 pm</span>          </div>  
        <div class="cutoff">&nbsp;</div>
      </div> 
  </div>   
</div>  </div>
</td>

我有上面的HTML代码。我想提取&#34; date&#34;标签( 2014-04-28 )和&#34; a href&#34;从上面标记(常规董事会会议)。我怎么能用Python做到这一点?这可以用美味汤来完成吗?

1 个答案:

答案 0 :(得分:2)

以下是通过BeautifulSoup

执行此操作的方法
from bs4 import BeautifulSoup


data = """
<html>
    <body>
        <td id="aisd_calendar-2014-04-28-0" class="single-day future" colspan="1" rowspan="1" date="**2014-04-28**" >
          <div class="inner">
            <div class="item">
          <div class="view-item view-item-aisd_calendar">
          <div class="calendar monthview">
                <div class="calendar.4168.field_date.8.0 contents">
                              <a href="/event/2013/regular-board-meeting">**Regular Board Meeting**</a>                      <span class="date-display-single">7:00 pm</span>          </div>
                <div class="cutoff">&nbsp;</div>
              </div>
          </div>
        </div>  </div>
        </td>
    </body>
</html>
"""
soup = BeautifulSoup(data)

td = soup.body.td  # or soup.find('td', id='aisd_calendar-2014-04-28-0')
print td['date'].strip('*')

link = soup.find('div', {'class': 'contents'}).a
print link['href']

打印:

2014-04-28
/event/2013/regular-board-meeting

此外,如果您需要将日期转换为python的datetime,则可以使用strptime()

from datetime import datetime

...

datetime.strptime(td['date'].strip('*'), '%Y-%m-%d')

希望有所帮助。