美丽的汤桌表解析

时间:2015-03-02 11:23:22

标签: python amazon-ec2 beautifulsoup

我们正在开展一个大学项目,我们希望从大学时间表中提取数据并在我们自己的项目中使用它。我们有一个提取数据的python脚本,它在本地机器上运行良好,但是当我们尝试在Amazon ec2上使用相同的脚本时,会出现错误。

from bs4 import BeautifulSoup
import requests

# url from timetable.ucc.ie showing 3rd Year semester 1 timetable
url = 'http://timetable.ucc.ie/showtimetable2.asp?filter=%28None%29&identifier=BSCS3&days=1-5&periods=1-20&weeks=5-16&objectclass=programme%2Bof%2Bstudy&style=individual'

# Retrieve the web page at url and convert the data into a soup object
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

# Retrieve the table containing the timetable from the soup object for parsing
timetable_to_parse = soup.find('table', {'class' : 'grid-border-args'})

i = 0  # i is an index into pre_format_day
pre_format_day = [[],[],[],[],[],[]]  # holds un-formatted day information
day = [[],[],[],[],[],[]]  # hold formatted day information
day[0] = pre_format_day[0]

# look at each td within the table
for slot in timetable_to_parse.findAll('td'):
    # if slot content is a day of the week, move pointer to next day
    # indicated all td's relating to a day have been looked at
    if slot.get_text() in ( 'Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri'):
        i += 1
    else:  # otherwise the td related to a time slot in a day
        try:
            if slot['colspan'] is "4":  #test if colspan of td is 4
                # if it is, append to list twice to represent 2 hours
                pre_format_day[i].append(slot.get_text().replace('\n',''))
                pre_format_day[i].append(slot.get_text().replace('\n',''))
        except:
            pass
        # if length of text of td is 1, > 11 or contains ":00"
        if len(slot.get_text()) == 1 or len(slot.get_text()) > 11 or ":00" in\
                slot.get_text():
            # add to pre_format_day
            pre_format_day[i].append(slot.get_text().replace('\n',''))

# go through each day in pre_format_day and insert formatted version in day[]
for i in range(1,6):
    j = 0
    while j < 20:
        if len(pre_format_day[i][j]) > 10:  # if there is an event store in day
            day[i].append(pre_format_day[i][j])
        else:  # insert space holder into slots with no events
            day[i].append('----- ')
        j += 2

# creates a string containing a html table for output
timetable = '<table><tr>'
timetable += '<th></th>'
for i in range(0, 10):
    timetable += '<th>' + day[0][i] + '</th> '

days = ['', 'Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri']

for i in range(1,6):
    timetable += '</tr><tr><th>' + days[i] + '</th>'
    for j in range(0,10):
        if len(day[i][j]) > 10:
            timetable += '<td class="lecture">' + day[i][j] + '</td>'
        else:
            timetable += '<td></td>'

timetable += '</tr></table>'

# output timetable string
print timetable

本地计算机上的输出是一个包含所需数据的表。

ec2实例的输出是 Traceback(最近一次调用最后一次):   文件&#34; parse2.py&#34;,第21行,in     对于timetable_to_parse.findAll中的插槽(&#39; td&#39;): AttributeError:&#39; NoneType&#39;对象没有属性&#39; findAll&#39;

这两台机器都在运行Ubuntu 14.10,Python 2.7但是由于某些原因我无法弄清楚它似乎没有从url获取所需的页面并从中提取表格但是之后我丢失了。

任何帮助非常感谢。

2 个答案:

答案 0 :(得分:2)

问题是ec2正在使用不同的解析器到本地机器。 固定的。

apt-get install python -lxml

答案 1 :(得分:0)

登录EC2实例并在Python CLI中逐行浏览,直到找到问题为止。出于某种原因,BeautifulSoup解析在不同系统上的工作方式略有不同。我遇到了同样的问题,我不知道背后的原因。在不知道HTML内容的情况下,我们很难为您提供具体的帮助。