将pandas数据帧批量插入PostgreSQL表的最佳方法

时间:2018-02-12 18:03:23

标签: postgresql python-3.6 psycopg2

我需要将多个excel文件上传到postgresql表,但是他们可以在几个寄存器中互相激活,因此我需要注意IntegrityErrors。我遵循两种方法:

cursor.copy_from:最快的方法,但由于重复的寄存器,我不知道如何捕捉和控制所有Integrityerrors

streamCSV = StringIO()
streamCSV.write(invoicing_info.to_csv(index=None, header=None, sep=';')) 
streamCSV.seek(0)  

with conn.cursor() as c:
    c.copy_from(streamCSV, "staging.table_name", columns=dataframe.columns, sep=';')
    conn.commit()

cursor.execute:我可以计算并处理每个异常,但它非常 慢。

data = invoicing_info.to_dict(orient='records')

with cursor as c:
    for entry in data:
        try:
            c.execute(DLL_INSERT, entry)
            successful_inserts += 1
            connection.commit()
            print('Successful insert. Operation number {}'.format(successful_inserts))
        except psycopg2.IntegrityError as duplicate:
            duplicate_registers += 1
            connection.rollback()
            print('Duplicate entry. Operation number {}'.format(duplicate_registers))

在例程结束时,我需要确定以下信息:

print("Initial shape: {}".format(invoicing_info.shape))
print("Successful inserts: {}".format(successful_inserts))
print("Duplicate entries: {}".format(duplicate_registers))

如何修改控制所有异常的第一种方法?如何优化第二种方法?

2 个答案:

答案 0 :(得分:3)

虽然您在不同的Excel工作表中有重复的ID,但您必须自己回答如何决定从哪个Excel工作表中信任的数据?

当您使用多个表时,并且将使用方法从冲突对中至少有一行,您始终可以执行以下操作:

  • 为每个Excel工作表创建临时表
  • 将数据上传到excel表格的每个表格(就像现在批量处理一样)
  • 以某种方式从(id)中选择拾取不同的插入:
INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id) 
     id, col1, col2
FROM 
(
    SELECT id, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet1
    UNION
    SELECT id, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet2
    UNION
    SELECT id, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet3
) as data

使用这样的插入postgreSQL将从非唯一ID集中取出随机行。

如果您想信任第一条记录,可以添加一些订单:

INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id) 
     id, ordering_column col1, col2
FROM 
(
    SELECT id, 1 as ordering_column, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet1
    UNION
    SELECT id, 2 as ordering_column, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet2
    UNION
    SELECT id, 3 as ordering_column, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet3
) as data
ORDER BY ordering_column

对象的初始计数:

SELECT sum(count)
FROM 
( 
  SELECT count(*) as count FROM temp_table_for_excel_sheet1
  UNION
  SELECT count(*) as count FROM temp_table_for_excel_sheet2
  UNION
  SELECT count(*) as count FROM temp_table_for_excel_sheet3
) as data

完成此批量插入后,您可以运行select count(*) FROM staging.table_name以获取插入记录总数的结果

您可以运行重复计数:

SELECT sum(count)
FROM 
(
SELECT count(*) as count 
FROM  temp_table_for_excel_sheet2 WHERE id in (select id FROM temp_table_for_excel_sheet1 )

UNION

SELECT count(*) as count 
FROM  temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
)

UNION

SELECT count(*) as count 
FROM  temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet2 )
) as data

答案 1 :(得分:0)

如果excel表包含重复记录,Pandas似乎是识别和消除欺骗的可能选择:https://33sticks.com/python-for-business-identifying-duplicate-data/。或者是不同表中的不同记录具有相同ID /索引的问题?如果是这样,在尝试上载到SQL数据库之前,使用Pandas隔离多次使用的ID然后使用唯一标识符更正它们的情况下,类似的方法可能会起作用。

对于批量上传,我使用的是ORM。 SQLAlchemy有关于批量上传的一些很好的信息:http://docs.sqlalchemy.org/en/rel_1_0/orm/persistence_techniques.html#bulk-operations,这里有一个相关的讨论:Bulk insert with SQLAlchemy ORM