网站刮刮Python / General

时间:2012-05-20 01:52:03

标签: python url web-scraping beautifulsoup

我正在试图抓一个网站并遇到一些问题;一个它说我的代码坏了,我似乎无法找到问题。我也不确定在寻找正则表达式时要放什么。例如:

patFinderName = re.compile('<a rel.*href >(.*)</a>')

如果标签如下:

"<A HREF="javascript:popUp('https://events...edu47&idy=2012','580','580')">

最后,当不同的信息共享类似的标签,如

时,如何分开数据

这是我的代码:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen('https://events.bc.edu/cgi-bin/publish/webevent.cgi?cmd=listweek&y=2012&    m=05&d=20&de=1&tf=0&sib=1&sb=0&sa=0&ws=0&stz=Default&sort=e,m,t&cat=&swe=1&cf=list&set=1&cal=cal2').read

patFinderName = re.compile('<a rel.*href >(.*)</a>')
patFinderStartDate = re.compile('<span>(.*)</span>')
patFinderLocation = re.compile('<td><tr><tr><td>(.*)</tr></td><td><tr>')
patFinderInfo = re.compile('<tr><td><span>(.*)</span></TD></tr></table>')
patFinderLink = re.compile('<a rel.*href"(.*)">')

findPatName = re.findall(patFinderName,webpage)
findPatStartDate = re.findall(patFinderStartDate,webpage)
findPatLocation = re.findall(patFinderLocation,webpage)
findPatInfo = re.findall(patFinderInfo,webpage)
findPatLink = re.findall(patFinderLink,webpage)

listIterator = []
listIterator[:] = range(1,50)

for i in listIterator:
print findPatName[i]
print findPatStartDate[i]
print findPatLocation[i]
print findPatInfo[i]
print findPatLink[i]

articlePage = url(findPatLink[i]).read()
divBegin =articlePage.find('<div class="listeventName">')

article = articlePage[divBegin:(divBegin+1000)]

soup = BeautifulSoup(article)

Namelist = soup.findAll('a')
StartDatelist = soup.findAll('span')
Locationlist = soup.findAll('td')
infolist = soup.findAll('span')
linklist = soup.findAll('a')

for i in NameList:
    print i
    print "\n"

print "<!------------End Names-------------->"

for i in startDateList:
    print i
    print "\n"

print "<!------------End StartDate-------------->"

for i in locationList:
    print i
    print "\n"

print "<!------------End Locations-------------->"

for i in infoList:
    print i
    print "\n"

print "<!------------End Info-------------->"

for i in linkList:
    print i
    print "\n"

print "<!------------End Document-------------->"

0 个答案:

没有答案