我正在尝试为类创建数据抓取文件,而我必须抓取的数据要求我使用while循环将正确的数据放入单独的数组中 - 即状态和SAT平均值等
然而,一旦我设置了while循环,我的正则表达式清除了数据中的大多数html标签,并且我收到的错误是:
属性错误:'NoneType'对象没有属性'groups'
我的代码是:
import re, util
from BeautifulSoup import BeautifulStoneSoup
# create a comma-delineated file
delim = ", "
#base url for sat data
base = "http://www.usatoday.com/news/education/2007-08-28-sat-table_N.htm"
#get webpage object for site
soup = util.mysoupopen(base)
#get column headings
colCols = soup.findAll("td", {"class":"vaTextBold"})
#get data
dataCols = soup.findAll("td", {"class":"vaText"})
#append data to cols
for i in range(len(dataCols)):
colCols.append(dataCols[i])
#open a csv file to write the data to
fob=open("sat.csv", 'a')
#initiate the 5 arrays
states = []
participate = []
math = []
read = []
write = []
#split into 5 lists for each row
for i in range(len(colCols)):
if i%5 == 0:
states.append(colCols[i])
i=1
while i<=250:
participate.append(colCols[i])
i = i+5
i=2
while i<=250:
math.append(colCols[i])
i = i+5
i=3
while i<=250:
read.append(colCols[i])
i = i+5
i=4
while i<=250:
write.append(colCols[i])
i = i+5
#write data to the file
for i in range(len(states)):
states = str(states[i])
participate = str(participate[i])
math = str(math[i])
read = str(read[i])
write = str(write[i])
#regex to remove html from data scraped
#remove <td> tags
line = re.search(">(.*)<", states).groups()[0] + delim + re.search(">(.*)<", participate).groups()[0]+ delim + re.search(">(.*)<", math).groups()[0] + delim + re.search(">(.*)<", read).groups()[0] + delim + re.search(">(.*)<", write).groups()[0]
#append data point to the file
fob.write(line)
有关此错误突然出现的原因的任何想法?正则表达式工作正常,直到我试图将数据拆分为不同的列表。我已经尝试在最后的“for”循环中打印各种字符串,以查看它们中的任何一个是否为第一个i值(0)的“无”,但它们都是它们应该是的字符串。
非常感谢任何帮助!
答案 0 :(得分:1)
看起来正则表达式搜索失败了(其中一个)字符串,因此返回None
而不是MatchObject
。
尝试以下操作,而不是非常长的#remove <td> tags
行:
out_list = []
for item in (states, participate, math, read, write):
try:
out_list.append(re.search(">(.*)<", item).groups()[0])
except AttributeError:
print "Regex match failed on", item
sys.exit()
line = delim.join(out_list)
这样,您就可以找到正则表达式失败的位置。
另外,我建议您使用.group(1)
代替.groups()[0]
。前者更明确。