Question

我正在尝试读取固定宽度的文件并将数据批量加载到Postgres服务器上的各种表中。每一张表一个文件。

我已经编写了一个为我创建表的函数。我只需要填充它。

在阅读了文档和其他研究后，我发现了两个有用的命令：psycopg2中的“ copy_from”和熊猫中的“ .to_sql”。我成功创建了使用后者的实现。桌子看起来很棒。唯一的问题是，上传110,000行100列的文件大约需要14分钟。

前一种方法显然是一种更快的方法。我只是无法使其正常工作。

下面是我到目前为止的代码：

import pandas as pd
import psycopg2 as pg

def sql_gen(cur,conn,filename,widths,col_names,table,header=False,chunk_size=10**5):

    df = pd.read_fwf(filename,widths=widths,index_col=False,header=None,iterator=True,
                     chunksize=chunk_size,names=cols)

    for chunk in df:
        cur.copy_from(df,table,null="")
        yield conn.commit()

#parameters
data_path = 'data.dat'
table = 'example_table'

#some stuff to extract stuff we need
widths = getwidths
cols = getcols

#main part of script
conn = None
try:
    conn = pg.connect('connectionstring')
    cursor = conn.cursor()

    for sql in sql_gen(cursor,conn,data_path,widths,cols,table):
        print(sql)

    # close communication with the PostgreSQL database server
    cursor.close()
except (Exception, pg.Error) as error :
    print(error)
finally:
    #closing database conn.
    if conn is not None:
        conn.close()
        print("PostgreSQL conn is closed")

我希望这能奏效，但我却得到了TypeError:

argument 1 must have both .read() and .readline() methods

根据要求进行完整的追溯：


  File "<ipython-input-10-542d72b61dd4>", line 4, in <module>
    for sql in sql_gen(cursor,conn,data_path,widths,cols,table):

  File "<ipython-input-8-f82fb5831db3>", line 7, in sql_generator
    cur.copy_from(df,'example_table',null="")```

使用psycopg2和pandas从固定宽度的文件填充postgres表

0 个答案: