Question

我可能已经习惯于大数据技术，但是当我尝试插入约30万行以总计30Mb（以csv格式）时，我认为15分钟不是用Postgres进行INSERT的可接受时间。

我首先了解到，增加GCP上的磁盘总大小也会增加IOPS，因此我将磁盘从20Go增大到400Go。

然后环顾互联网，我发现了这篇文章：https://naysan.ca/2020/05/09/pandas-to-postgresql-using-psycopg2-bulk-insert-performance-benchmark/

我正在使用旧的onOptionsItemSelected()插入我的数据（疯了，对！）。我记得有一个类似df.to_sql()的参数适用于MSSQL驱动程序，但不适合fast_executemany。

因此，我尝试通过复制内存中的df并将其写入数据库来尝试本文中的更快方法。通过所有这些升级（磁盘大小+代码优化），我从30/35分钟升级到〜15。

改进！但仍然约1000秒

这里可能是什么问题？我的互联网连接非常好，因此瓶颈不在这里。

以下是我正在做的事情的示例：

psycopg2

我要定位的表本身：

import pandas as pd
import psycopg2
import time
from io import StringIO


if __name__ == '__main__':
    startTime = time.time()
    df = pd.read_gbq(
        "SELECT * FROM dataset.defaut",
        project_id="my_id_12345",
        location="europe-west1"
    )

    executionTime = (time.time() - startTime)
    print('Execution GBQ Read time in seconds: ' + str(executionTime))

    df_qgis = df[
        ["voie", "direction", "date_mesure",
         "dfo_id", "pk_ref", "type_defaut",
         "niveau_ref", "val_ref", "longueur",
         "pk_debut", "pk_fin", "pk", "pk_original"]]

    param_dic = {
        "host": "1.2.3.4",
        "database": "qgis",
        "user": "user",
        "password": "y0lo123"
    }


    def connect(params_dic):
        """ Connect to the PostgreSQL database server """
        conn = None
        try:
            # connect to the PostgreSQL server
            print('Connecting to the PostgreSQL database...')
            conn = psycopg2.connect(**params_dic)
        except (Exception, psycopg2.DatabaseError) as error:
            print(error)
            exit(1)
        print("Connection successful")
        return conn


    conn = connect(param_dic)


    def copy_from_stringio(conn, df, table):
        """
        Here we are going save the dataframe on disk as
        a csv file, load the csv file
        and use copy_from() to copy it to the table
        """
        # save dataframe to an in memory buffer
        buffer = StringIO()
        df.to_csv(buffer, index_label='id', header=False)
        buffer.seek(0)
        cursor = conn.cursor()
        try:
            cursor.copy_from(buffer, table, sep=",")
            conn.commit()
        except (Exception, psycopg2.DatabaseError) as error:
            print("Error: %s" % error)
            conn.rollback()
            cursor.close()
            return 1
        print("copy_from_file() done")
        cursor.close()


    copy_from_stringio(conn, df_qgis, "develop.defaut")

    executionTime = (time.time() - startTime)
    print('Execution final time in seconds: ' + str(executionTime))

Answer 1

根据Cloud SQL best practices，以加快导入速度（对于小型实例）：

您可以暂时increase the tier实例进行改进大型数据集时的性能。

Here，您可以找到PostgreSQL实例的示例计算机类型。

如果这样做没有帮助，建议您与GCP technical support联系，以使用其内部工具检查实例，以查看其资源是否耗尽。

Cloud SQL-Postgres-插入速度非常慢

1 个答案: