我一直在使用Python从Postgres数据库中获取数据。它占用了大量的内存。如下所示:
以下功能是我正在运行的唯一功能,它占用了过多的内存。我正在使用fetchmany()
并以小块的形式获取数据。我还尝试迭代使用cur
光标。但是,所有这些方法都会导致内存使用量过大。有没有人知道为什么会发生这种情况?在Postgres端有什么需要调整的,可以帮助缓解这个问题吗?
def checkMultipleLine(dbName):
'''
Checks for rows that contain data spanning multiple lines
This is the most basic of checks. If a aprticular row has
data that spans multiple lines, then that particular row
is corrupt. For dealing with these rows we must first find
out whether there are places in the database that contains
data that spans multiple lines.
'''
logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
logger.info('Finding rows that span multiple lines')
schema = findTables(dbName)
results = []
for t in tqdm(sorted(schema.keys())):
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
cur.execute('select * from %s'%t)
n = 0
N = 0
while True:
css = cur.fetchmany(1000)
if css == []: break
for cs in css:
N += 1
if any(['\n' in c for c in cs if type(c)==str]):
n += 1
cur.close()
conn.close()
tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
results.append({
'tableName': t,
'totalRows': N,
'badRows' : n,
})
logger.info('Finished checking for multiple lines')
results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
print results
results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)
return results
答案 0 :(得分:2)
Psycopg2支持server-side cursors用于大型查询,如answer所述。以下是如何将其与客户端缓冲区设置一起使用:
cur = conn.cursor('cursor-name')
cur.itersize = 10000 # records to buffer on a client
这应该减少内存占用。