Question

我在postgresql数据库中有一个约900,000行的表。我希望在转换每一行并将数据添加到新列之后，逐行将其复制到另一个包含一些额外列的表中。问题是RAM已满。

以下是代码的相关部分：

engine = sqlalchemy.create_engine(URL(**REMOTE), echo=False)
Session = sessionmaker(bind=engine)
session = Session()
n=1000
counter=1
for i in range(1,total+1,n):
    ids=str([j for j in range(i,i+n)])
    **q="SELECT * from table_parts where id in (ids)"%ids**
    r=session.execute(q).fetchall()
    for element in r:
        data={}
        ....
       [taking data from each row, extracting string,calculation,
        and filling extra columns that the new table has]
       ...
    query=query.bindparams(**data)
    try:
        session.execute(query)
    except:
        session.rollback()
        raise 
    if counter%n==0:
        print COMMITING....",counter,datetime.datetime.now("%H:%M:%S")
           session.commit()
    counter+=1

查询是正确的，因此没有错误。在按Ctrl + C之前，新表会正确更新。

问题似乎在于查询： “SELECT * from table_parts where id in (1,2,3,4...1000)” 我已经尝试过postgresql数组。

我已经尝试过的事情：

results = (connection .execution_options(stream_results=True) # Added this line .execute(query)) from here。据我所知，这与postgresql一起使用时使用服务器端游标。我在发布的代码中抛弃了会话对象并使用了engine.connect()
- 在每次迭代时创建新连接对象，令人惊讶的是，这也不起作用。 RAM充满了

来自文档，

Note that the stream_results execution option is enabled automatically if the yield_per() method is used.

所以查询api中的yield_per与上面提到的stream_result选项相同

感谢

Answer 1

create table table_parts ( id serial primary key, data text );
-- Insert 1M rows of about 32kB data =~ 32GB of data
-- Needs only 0.4GB of disk space because of builtin compression
-- Might take a few minutes
insert into table_parts(data)
  select rpad('',32*1024,'A') from generate_series(1,1000000);

以下使用SQLAlchemy.Core的代码不会占用大量内存：

import sqlalchemy
import datetime
import getpass

metadata = sqlalchemy.MetaData()
table_parts = sqlalchemy.Table('table_parts', metadata,
    sqlalchemy.Column('id', sqlalchemy.Integer, primary_key=True),
    sqlalchemy.Column('data', sqlalchemy.String)
)

engine = sqlalchemy.create_engine(
    'postgresql:///'+getpass.getuser(),
    echo=False
)
connection = engine.connect()

n = 1000

select_table_parts_n = sqlalchemy.sql.select([table_parts]).\
    where(table_parts.c.id>sqlalchemy.bindparam('last_id')).\
    order_by(table_parts.c.id).\
    limit(n)

update_table_parts = table_parts.update().\
    where(table_parts.c.id == sqlalchemy.bindparam('table_part_id')).\
    values(data=sqlalchemy.bindparam('table_part_data'))

last_id=0
while True:
    with connection.begin() as transaction:
        row = None
        for row in connection.execute(select_table_parts_n, last_id=last_id):
            data = row.data.replace('A','B')
            connection.execute(
                update_table_parts,
                table_part_id=row.id,
                table_part_data=data
            )
        if not row:
            break
        else:
            print "COMMITING {} {:%H:%M:%S}".\
                format(row.id,datetime.datetime.now())
            transaction.commit()
            last_id=row.id

您似乎没有使用ORM功能，所以我想您也应该使用SQLAlchemy.Core。

sqlalchemy和postgresql

1 个答案: