应用引擎的python脚本优化

时间:2009-11-27 15:20:39

标签: python google-app-engine

我有以下脚本用于从我的uni网站中删除数据并插入GAE Db

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl  = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech =  Browser()
try:
    page = mech.open(url)
    html = page.read()
except Exception, err:
    print str(err)
#print html 
soup = BeautifulSoup(html)
soup.prettify() 
tables  = soup.find('select')
for options in tables:
    intake = options.string
    #print intake
    try:
        #print viewurl+intake
        page = mech.open(viewurl+intake)
        html = page.read()
        print html
        if html=="Exist in database":
            print intake, " Exist in the database skiping"
        else:
            page = mech.open(inserturl+intake)
            html = page.read()
            print html
            if html=="Ok":
                print intake, "added to the database"
            else:
                print "Error adding ",  intake, " to database"
    except Exception, err:
        print str(err)

我想知道优化此脚本的最佳方法是什么,以便我可以在app引擎服务器上运行它。实际上,它现在正在抓取300多个条目,并且需要花费超过10分钟才能在我的本地机器上插入所有数据

用于存储数据的模型是

class Intake(db.Model):
    intake=db.StringProperty(multiline=False, required=True)
    #@permerlink    
    def get_absolute_url(self):
        return "/timekeeper/%s/" % self.intake
    class Meta:
        db_table = "Intake"
        verbose_name_plural = "Intakes"
        ordering = ['intake']

3 个答案:

答案 0 :(得分:4)

Divide and conquer

  1. 列出任务列表(例如,要抓取/解析的网址)
  2. 将您的任务添加到队列中(appengine taskqueue apiamazon sqs,...)
  3. 处理您的队列

答案 1 :(得分:2)

您应该做的第一件事是重写脚本以直接使用App Engine数据存储区。您花费的大部分时间无疑是因为您正在使用HTTP请求(每个条目两个!)将数据插入数据存储区。直接使用数据存储区batch puts应该可以减少运行时间的几个数量级。

如果您的解析代码仍然太慢,您可以将工作分成几个块并使用task queue API在多个请求中完成工作。

答案 2 :(得分:1)

嗨根据tosh和昵称,我已将脚本修改为吼叫

from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
    page = urlfetch.fetch(url)
    #print html 
    soup = BeautifulSoup(page.content)
    soup.prettify() 
    tables  = soup.find('select')
    models=[]
    for options in tables:
        intake_code = options.string
        if Intake.all().filter('intake',intake_code).count()<1:
            data = Intake(intake=intake_code)
            models.append(data)
    try:
        if len(models)>0:
            db.put(models)
        else:
            pass 
    except Exception,err:
        pass
except Exception, err:
    print str(err)
我正走在正确的轨道上吗?我也不确定如何按计划(每周一次)调用这个最好的方法来做什么?

并感谢提示答案