使用Python将日志文件中的批量加载到PostgreSQL中

时间:2014-12-30 05:47:04

标签: python postgresql

这是一个后续问题。下面是我的Python脚本,它读取不断增长的日志文件(文本)并将数据插入Postgresql DB。每天生成新的日志文件。我所做的是我提交每条线路,这会产生巨大的负载和非常差的性能(需要4个小时才能插入30分钟的文件数据!)。如何改进此代码以插入大量的行?这会有助于提高性能并减少负载吗?我读过有关copy_from的内容,但在这种情况下无法弄清楚如何使用它。

   import psycopg2 as psycopg
                    try:
                      connectStr = "dbname='postgis20' user='postgres' password='' host='localhost'"
                      cx = psycopg.connect(connectStr)
                      cu = cx.cursor()
                      logging.info("connected to DB")
                    except:
                      logging.error("could not connect to the database")


                import time
                file = open('textfile.log', 'r')
                while 1:
                    where = file.tell()
                    line = file.readline()
                    if not line:
                        time.sleep(1)
                        file.seek(where)
                    else:
                        print line, # already has newline
                        dodecode(line)
            ------------
    def dodecode(fields):
   global cx
   from time import strftime, gmtime
   from calendar import timegm
   import os
   msg = fields.split(',')
   part = eval(msg[2])
   msgnum = int(msg[3:6])
   print "message#:", msgnum
   print fields

   if (part==1):
     if msgnum==1:
       msg1 = msg_1.decode(bv)
       #print "message1 :",msg1
       Insert(msgnum,time,msg1)
     elif msgnum==2:
       msg2 = msg_2.decode(bv)
       #print "message2 :",msg2
       Insert(msgnum,time,msg2)
     elif msgnum==3:
     ....
     ....
     ....    
        ----------------
        def Insert(msgnum,time,msg):
         global cx

         try:    
                 if msgnum in [1,2,3]:   
                  if msg['type']==0:
                    cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+"  WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")      
                    cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
                    cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
                  elif msg['type']==1:
                    cu.execute("INSERT INTO table1 ( messageid, timestamp, userid, position, text ) SELECT "+str(msgnum)+", '"+time+"', "+str(msg['UserID'])+", ST_GeomFromText('POINT("+str(float(msg['longitude']), '"+text+"')+" "+str(float(msg['latitude']))+")']))+"  WHERE NOT EXISTS (SELECT * FROM table1 WHERE timestamp='"+time+"' AND text='"+text+"';")    
                    cu.execute("INSERT INTO table2 ( field1,field2,field3, time_stamp, pos,) SELECT "+str(msg['UserID'])+","+str(int(msg['UserName']))+","+str(int(msg['UserIO']))+", '"+time+"', ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")')," WHERE NOT EXISTS (SELECT * FROM table2 WHERE field1="+str(msg['UserID'])+");")
                    cu.execute("Update table2 SET field3='"+str(int(msg['UserIO']))+"',time_stamp='"+str(time)+"',pos=ST_GeomFromText('POINT("+str(float(msg['longitude']))+" "+str(float(msg['latitude']))+")'),"' WHERE field1='"+str(msg['UserID'])+"' AND time_stamp < '"+str(time)+"';")
                  elif msg['type']==2:
                ....
                ....
                ....
     except Exception, err:
             #print('ERROR: %s\n' % str(err))
             logging.error('ERROR: %s\n' % str(err))
             cx.commit()

     cx.commit()  

1 个答案:

答案 0 :(得分:1)

每个事务执行多行,每个查询将使其更快,

当遇到类似的问题时,我在插入查询的值部分放了多行, 但是你有复杂的插入查询,所以你可能需要一种不同的方法。

我建议创建一个临时表,并使用普通的多行插入插入10000行:

insert into temptable values ( /* row1 data */ ) ,( /* row2 data */ ) etc...

每个插入500行。这是一个很好的起点。

然后将临时表与现有数据连接起来以进行重复数据删除。

delete from temptable using livetable where /* .join condition */ ;
如果需要的话,

并对其进行重复删除

delete from temptable where id not in 
  ( select distinct on ( /* unique columns */) id from temptable);

然后使用insert-select将临时表中的行复制到实时表

insert into livetable ( /* columns */ )
  select /* columns */ from temptable; 

看起来您可能还需要update-from

最后放下临时表并重新开始。

你正在写两张桌子;你需要加倍所有这些操作。

我通过维护一个count和一个值列表来插入然后在插入时进行插入 根据需要多次重复查询(%s,%s,%s,%s)部分,并单独传递值列表,让psycopg2处理格式化。

我希望通过这些更改可以让你加速5次以获得更多