我有以下脚本用于从我的uni网站中删除数据并插入GAE Db
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech = Browser()
try:
page = mech.open(url)
html = page.read()
except Exception, err:
print str(err)
#print html
soup = BeautifulSoup(html)
soup.prettify()
tables = soup.find('select')
for options in tables:
intake = options.string
#print intake
try:
#print viewurl+intake
page = mech.open(viewurl+intake)
html = page.read()
print html
if html=="Exist in database":
print intake, " Exist in the database skiping"
else:
page = mech.open(inserturl+intake)
html = page.read()
print html
if html=="Ok":
print intake, "added to the database"
else:
print "Error adding ", intake, " to database"
except Exception, err:
print str(err)
我想知道优化此脚本的最佳方法是什么,以便我可以在app引擎服务器上运行它。实际上,它现在正在抓取300多个条目,并且需要花费超过10分钟才能在我的本地机器上插入所有数据
用于存储数据的模型是
class Intake(db.Model):
intake=db.StringProperty(multiline=False, required=True)
#@permerlink
def get_absolute_url(self):
return "/timekeeper/%s/" % self.intake
class Meta:
db_table = "Intake"
verbose_name_plural = "Intakes"
ordering = ['intake']
答案 0 :(得分:4)
答案 1 :(得分:2)
您应该做的第一件事是重写脚本以直接使用App Engine数据存储区。您花费的大部分时间无疑是因为您正在使用HTTP请求(每个条目两个!)将数据插入数据存储区。直接使用数据存储区batch puts应该可以减少运行时间的几个数量级。
如果您的解析代码仍然太慢,您可以将工作分成几个块并使用task queue API在多个请求中完成工作。
答案 2 :(得分:1)
嗨根据tosh和昵称,我已将脚本修改为吼叫
from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
page = urlfetch.fetch(url)
#print html
soup = BeautifulSoup(page.content)
soup.prettify()
tables = soup.find('select')
models=[]
for options in tables:
intake_code = options.string
if Intake.all().filter('intake',intake_code).count()<1:
data = Intake(intake=intake_code)
models.append(data)
try:
if len(models)>0:
db.put(models)
else:
pass
except Exception,err:
pass
except Exception, err:
print str(err)
我正走在正确的轨道上吗?我也不确定如何按计划(每周一次)调用这个最好的方法来做什么?
并感谢提示答案