Question

我有一个数据分析脚本，我正在整理。此脚本从表中连接到Teradata，Select *，并将其加载到pandas数据帧中。

import teradata
import pandas as pd

with udaExec.connect(method="xxx", dsn="xxx", username="xxx", password="xxx") as session:

    query = "Select * from TableA"

    # read in records
    df = pd.read_sql(query, session)

    # misc pandas tests below...

这适用于具有100k或更少记录的表，但问题是许多表的记录远远多于（数百万和数百万条记录），并且它只是无限期地运行。

我可以采取一些中间步骤吗？我一直在研究，我看到的一些事情是先将数据库表复制到.csv文件或.txt文件，然后从中加载pandas数据帧（而不是从表本身加载），但我可以＆＃39;理解它。

任何建议将不胜感激！感谢。

Answer 1

在评论中，我承诺提供一些代码，可以将服务器中的表快速读入本地CSV文件，然后将该CSV文件读入Pandas数据帧。请注意，此代码是为postgresql编写的，但您可能很容易适应其他数据库。

以下是代码：

from cStringIO import StringIO
import psycopg2
import psycopg2.sql as sql
import pandas as pd

database = 'my_db'
pg_host = 'my_postgres_server'
table = 'my_table'
# note: you should also create a ~/.pgpass file with the credentials needed to access
# this server, e.g., a line like "*:*:*:username:password" (if you only access one server)

con = psycopg2.connect(database=database, host=pg_host)
cur = con.cursor()    

# Copy data from the database to a dataframe, using psycopg2 .copy_expert() function.
csv = StringIO()  # or tempfile.SpooledTemporaryFile()
# The next line is the right way to insert a table name into a query, but it requires 
# psycopg2 >= 2.7. See here for more details: https://stackoverflow.com/q/13793399/3830997
copy_query = sql.SQL("COPY {} TO STDOUT WITH CSV HEADER").format(sql.Identifier(table))
cur.copy_expert(copy_query, csv)
csv.seek(0)  # move back to start of csv data
df = pd.read_csv(csv)

这里还有一些代码通过CSV路由将大型数据帧写入数据库：

csv = StringIO()
df.to_csv(csv, index=False, header=False)
csv.seek(0)
try:
    cur.copy_from(csv, table, sep=',', null='\\N', size=8192, columns=list(df.columns))
    con.commit()
except:
    con.rollback()
    raise

我在我的10 Mbps办公室网络（不要问！）上测试了这个代码，其中70,000行表（5.3 MB作为CSV）。

从数据库中读取表格时，我发现上面的代码比pandas.read_sql()快了约1/3（5.5秒对8秒）。在大多数情况下，我不确定是否可以证明额外的复杂性。这可能和你能得到的一样快 - postgresql＆＃39; COPY TO ...命令非常快，Pandas＆＃39; read_csv。

在将数据帧写入数据库时，我发现使用CSV文件（上面的代码）比使用pandas＆＃39;快50倍。 df.to_sql()（5.8s vs 288s）。这主要是因为Pandas不使用多行插入。这似乎是多年来积极讨论的主题 - 见https://github.com/pandas-dev/pandas/issues/8953。

关于chunksize的一些注意事项：这可能无法满足大多数用户的期望。如果在chunksize中设置pandas.read_sql()，查询仍然作为一个命令运行，但结果会批量返回到您的程序中;这是通过迭代器完成的，迭代器依次产生每个块。如果在chunksize中使用pandas.to_sql()，则会导致插入分批完成，从而减少内存需求。但是，至少在我的系统上，每个批次仍然分解为每行的单个insert语句，并且这些语句需要长时间才能运行。

另请注意：odo包看起来很适合在数据框和任何数据库之间快速移动数据。我无法让它成功运行，但你可能会有更好的运气。更多信息：http://odo.pydata.org/en/latest/overview.html

Answer 2

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

.read_sql()方法似乎有一个chunksize参数。你试过df = pd.read_sql(query, session, chunksize = 100,000)之类的东西吗？（因为你说100k记录不是问题所以我使用了100k的块大小）。

将大表读入熊猫，是否有中间步骤？

2 个答案: