为每个请求添加标准客户端组块代码

Question

我试图将一个1.7G文件从Greenplum postgres数据源中拉入一个pandas数据帧。 psycopg2驱动程序需要大约8分钟才能加载。使用pandas“chunksize”参数没有帮助，因为psycopg2驱动程序选择所有数据到内存中，然后使用超过2G的RAM将其交给大熊猫。

为了解决这个问题，我正在尝试使用命名游标，但我发现的所有示例都会逐行循环。而这似乎很慢。 但主要问题似乎是我的SQL因为某些未知原因而停止在命名查询中工作。

目标

尽快加载数据而不做任何“不自然的事情行为“
尽可能使用SQLAlchemy - 用于一致性
将结果放在pandas数据框中以进行快速内存处理（备选方案？）

拥有“pythonic”（优雅）解决方案。我喜欢用上下文管理器这样做，但还没有那么远。

/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras

/// Connect to database - works
conn_chunky = psycopg2.connect(
    database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
    'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000

/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query) 
/// result returns None (normal behavior) but result is not iterable

df = pd.DataFrame(result.fetchall())

pandas调用返回AttributeError：'NoneType'对象没有属性'fetchall'失败似乎是由于使用了命名游标。尝试了fetchone，fetchmany等。注意这里的目标是让服务器块并以大块的形式提供数据，以便在带宽和CPU使用率之间取得平衡。通过df = df.append（行）循环只是很简单。

查看相关问题（不是同一个问题）：

为每个请求添加标准客户端组块代码

nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
    if first_loop:
        df = dfx
        first_loop = False
    else:
        df = df.append(dfx,ignore_index=True)

Answer 1

<强>更新

#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
               ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')

OLD回答：

我尝试使用内部Pandas方法从PostgreSQL读取数据：read_sql()：

from sqlalchemy import create_engine
engine = create_engine('postgresql://user@localhost:5432/dbname')

df = pd.read_sql(sql_query, engine)

使用Python 3.6，psycopg2和pandas访问大型数据集

为每个请求添加标准客户端组块代码

1 个答案: