在Python Html Parser </h1>中找到</p> <h1>标记之后的特定<p>标记

时间:2013-11-15 18:29:08

标签: python parsing html-parsing beautifulsoup urllib2

我正在尝试解析一系列网页,并在每个页面上出现标题后只抓取3个段落。它们都具有相同的格式(我认为)。我正在使用urllib2和漂亮的汤,但我不太确定如何跳转到标题,然后抓住它后面的几个

标签。我知道第一个分裂(“h1”)是不正确的,但到目前为止,这是我唯一不错的尝试。这是我的代码,

from bs4 import BeautifulSoup
import urllib2
from HTMLParser import HTMLParser

BANNED = ["/events/new"]

def main():

    soup = BeautifulSoup(urllib2.urlopen('http://b-line.binghamton.edu').read())

     for link in soup.find_all('a'):
         link = link.get('href')      
        if link != None and link not in BANNED and "/events/" in link:
            print()
            print(link)          
            eventPage = "http://b-line.binghamton.edu" + link
            bLineSubPage = urllib2.urlopen(eventPage)   
            bLineSubPageStr = bLineSubPage.read()
            headAccum = 0  
            for data in bLineSubPageStr.split("<h1>"):
                if(headAccum < 1):
                    accum = 0 
                    for subData in data.split("<p>"):
                        if(accum < 5):
                            try:
                                print(BeautifulSoup(subData).get_text())
                            except Exception as e:
                                print(e) 
                            accum+=1
                    print()
                headAccum += 1           
            bLineSubPage.close()         
            print()

main()

1 个答案:

答案 0 :(得分:0)

>>> page_txt = urllib2.urlopen("http://b-line.binghamton.edu/events/9305").read(
>>> soup = bs4.BeautifulSoup(pg.split("<h1>",1)[-1])
>>> print soup.find_all("p")[:3]

是你想要的吗?