Question

我正在尝试解析一系列网页，并在每个页面上出现标题后只抓取3个段落。它们都具有相同的格式（我认为）。我正在使用urllib2和漂亮的汤，但我不太确定如何跳转到标题，然后抓住它后面的几个

标签。我知道第一个分裂（“h1”）是不正确的，但到目前为止，这是我唯一不错的尝试。这是我的代码，

from bs4 import BeautifulSoup
import urllib2
from HTMLParser import HTMLParser

BANNED = ["/events/new"]

def main():

    soup = BeautifulSoup(urllib2.urlopen('http://b-line.binghamton.edu').read())

     for link in soup.find_all('a'):
         link = link.get('href')      
        if link != None and link not in BANNED and "/events/" in link:
            print()
            print(link)          
            eventPage = "http://b-line.binghamton.edu" + link
            bLineSubPage = urllib2.urlopen(eventPage)   
            bLineSubPageStr = bLineSubPage.read()
            headAccum = 0  
            for data in bLineSubPageStr.split("<h1>"):
                if(headAccum < 1):
                    accum = 0 
                    for subData in data.split("<p>"):
                        if(accum < 5):
                            try:
                                print(BeautifulSoup(subData).get_text())
                            except Exception as e:
                                print(e) 
                            accum+=1
                    print()
                headAccum += 1           
            bLineSubPage.close()         
            print()

main()

Answer 1

>>> page_txt = urllib2.urlopen("http://b-line.binghamton.edu/events/9305").read(
>>> soup = bs4.BeautifulSoup(pg.split("<h1>",1)[-1])
>>> print soup.find_all("p")[:3]

是你想要的吗？

在Python Html Parser </h1>中找到</p> <h1>标记之后的特定<p>标记

1 个答案: