Question

我一直在使用Python从Postgres数据库中获取数据。它占用了大量的内存。如下所示：

以下功能是我正在运行的唯一功能，它占用了过多的内存。我正在使用fetchmany()并以小块的形式获取数据。我还尝试迭代使用cur光标。但是，所有这些方法都会导致内存使用量过大。有没有人知道为什么会发生这种情况？在Postgres端有什么需要调整的，可以帮助缓解这个问题吗？

def checkMultipleLine(dbName):
    '''
    Checks for rows that contain data spanning multiple lines

    This is the most basic of checks. If a aprticular row has 
    data that spans multiple lines, then that particular row
    is corrupt. For dealing with these rows we must first find 
    out whether there are places in the database that contains
    data that spans multiple lines. 
    '''

    logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
    logger.info('Finding rows that span multiple lines')

    schema = findTables(dbName)

    results = []
    for t in tqdm(sorted(schema.keys())):

        conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
        cur  = conn.cursor()
        cur.execute('select * from %s'%t)
        n = 0
        N = 0
        while True:
            css = cur.fetchmany(1000)
            if css == []: break
            for cs in css:
                N += 1
                if any(['\n' in c for c in cs if type(c)==str]):
                    n += 1
        cur.close()
        conn.close()

        tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
        results.append({
            'tableName': t,
            'totalRows': N,
            'badRows'  : n,
        })


    logger.info('Finished checking for multiple lines')

    results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
    print results
    results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)

    return results

Answer 1

Psycopg2支持server-side cursors用于大型查询，如answer所述。以下是如何将其与客户端缓冲区设置一起使用：

cur = conn.cursor('cursor-name')
cur.itersize = 10000  # records to buffer on a client

这应该减少内存占用。

从Postgres数据库获取数据时过多的内存使用量

1 个答案: