我正在尝试解析一系列网页,并在每个页面上出现标题后只抓取3个段落。它们都具有相同的格式(我认为)。我正在使用urllib2和漂亮的汤,但我不太确定如何跳转到标题,然后抓住它后面的几个
标签。我知道第一个分裂(“h1”)是不正确的,但到目前为止,这是我唯一不错的尝试。这是我的代码,
from bs4 import BeautifulSoup
import urllib2
from HTMLParser import HTMLParser
BANNED = ["/events/new"]
def main():
soup = BeautifulSoup(urllib2.urlopen('http://b-line.binghamton.edu').read())
for link in soup.find_all('a'):
link = link.get('href')
if link != None and link not in BANNED and "/events/" in link:
print()
print(link)
eventPage = "http://b-line.binghamton.edu" + link
bLineSubPage = urllib2.urlopen(eventPage)
bLineSubPageStr = bLineSubPage.read()
headAccum = 0
for data in bLineSubPageStr.split("<h1>"):
if(headAccum < 1):
accum = 0
for subData in data.split("<p>"):
if(accum < 5):
try:
print(BeautifulSoup(subData).get_text())
except Exception as e:
print(e)
accum+=1
print()
headAccum += 1
bLineSubPage.close()
print()
main()
答案 0 :(得分:0)
>>> page_txt = urllib2.urlopen("http://b-line.binghamton.edu/events/9305").read(
>>> soup = bs4.BeautifulSoup(pg.split("<h1>",1)[-1])
>>> print soup.find_all("p")[:3]
是你想要的吗?