解析HTML表

时间:2013-12-23 00:10:01

标签: python html-parsing beautifulsoup

我有一个HTML表格,我需要将其解析为CSV文件。

import urllib2, datetime
olddate = datetime.datetime.strptime('5/01/13', "%m/%d/%y")
from BeautifulSoup import BeautifulSoup
print("dates,location,name,url")
def genqry(arga,argb,argc,argd):
return arga + "," + argb + "," + argc + "," + argd
part = 1
row = 1
contenturl = "http://www.robotevents.com/robot-competitions/vex-robotics-competition"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())
table = soup.find('table', attrs={'class': 'catalog-listing'})
rows = table.findAll('tr')
for tr in rows:
    try:
        if row != 1:
            cols = tr.findAll('td')
            for td in cols:
                if part == 1:
                    keep = 0
                    dates = td.find(text=True)
                    part = 2
                if part == 2:
                    location = td.find(text=True)
                    part = 2
                if part == 3:
                    name = td.find(text=True)
                    for a in tr.findAll('a', href=True):
                        url = a['href']
                # Compare Dates
                if len(dates) < 6:
                    newdate = datetime.datetime.strptime(dates, "%m/%d/%y")
                    if newdate > olddate:
                        keep = 1
                    else:
                        keep = 0
                else:
                    newdate = datetime.datetime.strptime(dates[:6], "%m/%d/%y")
                    if newdate > olddate:
                        keep = 1
                    else:
                        keep = 0
                if keep == 1:
                    qry = genqry(dates, location, name, url)
                    print(qry)
                row = row + 1
                part = 1
        else:
            row = row + 1
    except (RuntimeError, TypeError, NameError):
        print("Error: " + name)

我需要能够在2013年5月1日之后获得该表中的每个VEX事件。到目前为止,这段代码给我一个关于日期的错误,我似乎无法修复。也许比我更好的人可以修复这段代码?先谢谢,史密斯。

编辑#1:我得到的错误是:

Value Error: '\n10/5/13' does not match format '%m/%d/%y'

我认为我需要首先删除字符串开头的换行符。 编辑#2:让它运行,没有任何输出,任何帮助?

1 个答案:

答案 0 :(得分:0)

你的问题非常糟糕。在不知道确切错误的情况下,我猜测问题出在您的if len(dates) < 6:块上。请考虑以下事项:

>>> date = '10/5/13 - 12/14/13'
>>> len(date)
18
>>> date = '11/9/13'
>>> len(date)
7
>>> date[:6]
'11/9/1'

使您的代码更具Pythonic的一个建议:使用enumerate而不是row = row + 1

更新:跟踪您的代码,我得到dates的值如下:

>>> dates
u'\n10/5/13 - 12/14/13            \xa0\n        '