我正在试图抓一个网站并遇到一些问题;一个它说我的代码坏了,我似乎无法找到问题。我也不确定在寻找正则表达式时要放什么。例如:
patFinderName = re.compile('<a rel.*href >(.*)</a>')
如果标签如下:
"<A HREF="javascript:popUp('https://events...edu47&idy=2012','580','580')">
最后,当不同的信息共享类似的标签,如
时,如何分开数据这是我的代码:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage = urlopen('https://events.bc.edu/cgi-bin/publish/webevent.cgi?cmd=listweek&y=2012& m=05&d=20&de=1&tf=0&sib=1&sb=0&sa=0&ws=0&stz=Default&sort=e,m,t&cat=&swe=1&cf=list&set=1&cal=cal2').read
patFinderName = re.compile('<a rel.*href >(.*)</a>')
patFinderStartDate = re.compile('<span>(.*)</span>')
patFinderLocation = re.compile('<td><tr><tr><td>(.*)</tr></td><td><tr>')
patFinderInfo = re.compile('<tr><td><span>(.*)</span></TD></tr></table>')
patFinderLink = re.compile('<a rel.*href"(.*)">')
findPatName = re.findall(patFinderName,webpage)
findPatStartDate = re.findall(patFinderStartDate,webpage)
findPatLocation = re.findall(patFinderLocation,webpage)
findPatInfo = re.findall(patFinderInfo,webpage)
findPatLink = re.findall(patFinderLink,webpage)
listIterator = []
listIterator[:] = range(1,50)
for i in listIterator:
print findPatName[i]
print findPatStartDate[i]
print findPatLocation[i]
print findPatInfo[i]
print findPatLink[i]
articlePage = url(findPatLink[i]).read()
divBegin =articlePage.find('<div class="listeventName">')
article = articlePage[divBegin:(divBegin+1000)]
soup = BeautifulSoup(article)
Namelist = soup.findAll('a')
StartDatelist = soup.findAll('span')
Locationlist = soup.findAll('td')
infolist = soup.findAll('span')
linklist = soup.findAll('a')
for i in NameList:
print i
print "\n"
print "<!------------End Names-------------->"
for i in startDateList:
print i
print "\n"
print "<!------------End StartDate-------------->"
for i in locationList:
print i
print "\n"
print "<!------------End Locations-------------->"
for i in infoList:
print i
print "\n"
print "<!------------End Info-------------->"
for i in linkList:
print i
print "\n"
print "<!------------End Document-------------->"