用美丽的汤解析python

时间:2013-03-31 11:18:21

标签: python html parsing beautifulsoup

我有一个关于使用BeautifulSoup进行HTML解析的问题。我试图解析的网站就是这个:http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html?page=1&pageSize=40

首先,我需要编写一个能够为我提供所有h3标签和所有p标签的功能。我这样做了如下:

    from bs4 import BeautifulSoup
    import urllib2
    website=urllib2.urlopen("http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html","r")

    def parseUsingSoup2(content):
        list1=soup.findAll('h3')
        list2=soup.findAll('p')
        return list1+list2        

    parseUsingSoup2(website)

问题的下一部分要求提供4个元组的事件列表(网站上只有一个事件):时间段,标题,类型和描述。

我真的不知道如何开始。我的第一次尝试是这样的:

    def GeneratingListofEvents(content):
        event={}
        list=['time', 'title', 'feature', 'description']
        for item in list: 

但是,我不知道这是否朝着正确的方向发展,而且我还没有设法从HTML文档中检索时间,而无需手动输入。提前谢谢。

1 个答案:

答案 0 :(得分:0)

注意您需要的所有信息都在<div class="agendaright">

from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html","r")
soup = BeautifulSoup(html)

all = soup.find('div',class_="agendaright")
time = all.find('span',class_="event-time").text
# u'18:00 - 20:00'
title = all.h3.text
# u'Images Without Borders Violence, Visuality, and Landscape in Postwar Ambon, Indonesia'
feature = all.find('span',class_="feature").text
# u' | Lecture'
description = all.find('p',class_="event-description").text
# u'This lecture explores the thematization of the visual and expansion of\nits terrain exemplified by the gigantic hijacked billboards with Jesus\nfaces and the painted murals with Christian themes which arose during\nthe ...'

l = [time,title,feature,description]