这是一个后续问题。下面是我的Python脚本,它读取不断增长的日志文件(文本)并将数据插入Postgresql DB。每天生成新的日志文件。我所做的是我提交每条线路,这会产生巨大的负载和非常差的性能(需要4个小时才能插入30分钟的文件数据!)。如何改进此代码以插入大量的行?这会有助于提高性能并减少负载吗?我读过有关copy_from的内容,但在这种情况下无法弄清楚如何使用它。
import psycopg2 as psycopg
try:
connectStr = "dbname='postgis20' user='postgres' password='' host='localhost'"
cx = psycopg.connect(connectStr)
cu = cx.cursor()
logging.info("connected to DB")
except:
logging.error("could not connect to the database")
import time
file = open('textfile.log', 'r')
while 1:
where = file.tell()
line = file.readline()
if not line:
time.sleep(1)
file.seek(where)
else:
print line, # already has newline
dodecode(line)
------------
def dodecode(fields):
global cx
from time import strftime, gmtime
from calendar import timegm
import os
msg = fields.split(',')
part = eval(msg[2])
msgnum = int(msg[3:6])
print "message#:", msgnum
print fields
if (part==1):
if msgnum==1:
msg1 = msg_1.decode(bv)
#print "message1 :",msg1
Insert(msgnum,time,msg1)
elif msgnum==2:
msg2 = msg_2.decode(bv)
#print "message2 :",msg2
Insert(msgnum,time,msg2)
elif msgnum==3:
....
....
....
----------------
def Insert(msgnum,time,msg):
global cx
try:
if msgnum in [1,2,3]:
if msg['type']==0:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==1:
cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+" WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")
cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
elif msg['type']==2:
....
....
....
except Exception, err:
#print('ERROR: %s\n' % str(err))
logging.error('ERROR: %s\n' % str(err))
cx.commit()
cx.commit()
答案 0 :(得分:1)
每个事务执行多行,每个查询将使其更快,
当遇到类似的问题时,我在插入查询的值部分放了多行, 但是你有复杂的插入查询,所以你可能需要一种不同的方法。
我建议创建一个临时表,并使用普通的多行插入插入10000行:
insert into temptable values ( /* row1 data */ ) ,( /* row2 data */ ) etc...
每个插入500行。这是一个很好的起点。
然后将临时表与现有数据连接起来以进行重复数据删除。
delete from temptable using livetable where /* .join condition */ ;
如果需要的话,并对其进行重复删除
delete from temptable where id not in
( select distinct on ( /* unique columns */) id from temptable);
然后使用insert-select将临时表中的行复制到实时表
insert into livetable ( /* columns */ )
select /* columns */ from temptable;
看起来您可能还需要update-from
最后放下临时表并重新开始。
你正在写两张桌子;你需要加倍所有这些操作。
我通过维护一个count和一个值列表来插入然后在插入时进行插入
根据需要多次重复查询(%s,%s,%s,%s)
部分,并单独传递值列表,让psycopg2处理格式化。
我希望通过这些更改可以让你加速5次以获得更多