如何在雪花数据库中更有效地插入json响应数据?

时间:2020-01-24 17:36:23

标签: python json python-3.x etl snowflake-cloud-data-platform

我目前正在遍历json响应,并逐行插入每一行。

即使插入几千行数据,这也非常慢。

插入数据最有效的方法是什么?

这是我的代码。

from module import usr, pwd, acct, db, schem, api_key
import snowflake.connector
import datetime

end_point = 'users'

def snowflake_connect():
    global cursor, mydb
    mydb = snowflake.connector.connect(
        user=usr,
        password=pwd,
        account=acct,
        database=db,
        schema=schem,
    )

def snowflake_insert(id, activated, name):
    global cursor
    snowflake_connect()
    cursor = mydb.cursor()
    sql_insert_query = """ INSERT INTO USERS(ID, ACTIVATED, NAME) VALUES (%s, %s, %s)"""
    insert_tuple = (id, activated, name)
    cursor.execute(sql_insert_query, insert_tuple)
    return cursor

def get_users():
    url = 'https://company.pipedrive.com/v1/{}?&api_token={}'.format(end_point,api_key)
    response = requests.request("GET", url).json()
    read_users(response)

def read_users(response):   
    for data in response['data']:
        id = data['id']
        activated = data['activated']
        name = data['name']     
        snowflake_insert(id, activated, name)

if __name__ == "__main__":  
    snowflake_truncate()
    get_users()
cursor.close()

1 个答案:

答案 0 :(得分:2)

others in comments指出,要获得最高的效率(尤其是对于连续加载),请直接将格式化的数据文件加载到Snowflake中,而不要使用INSERT语句作为最佳实践。

但是,描述中的代码也可以进一步改进,以最小化每个插入行创建的开销。一些主要观察结果:

修改后的代码版本:

from module import usr, pwd, acct, db, schem, api_key
import snowflake.connector
import datetime

end_point = 'users'
MYDB = None

def snowflake_connect():
    if MYDB is None:
        MYDB = snowflake.connector.connect(
            user=usr,
            password=pwd,
            account=acct,
            database=db,
            schema=schem,
        )

def snowflake_insert_all(rows):
    snowflake_connect()
    cursor = MYDB.cursor()
    sql_insert_query = "INSERT INTO USERS(ID, ACTIVATED, NAME) VALUES (?, ?, ?)"
    cursor.executemany(sql_insert_query, rows)
    return cursor

def get_users():
    url = 'https://company.pipedrive.com/v1/{}?&api_token={}'.format(end_point,api_key)
    response = requests.request("GET", url).json()
    read_users(response)

def read_users(response):
    # 
    all_data = [(data['id'], data['activated'], data['name']) for data in response['data']]
    snowflake_insert_all(all_data)

if __name__ == "__main__":  
    snowflake_truncate()
    get_users()
    if MYDB is not None:
      MYDB.close()

注意:在这里,我只专注于改进Snowflake和DB-API交互部分,但总的来说还有其他错误(变量和方法命名,不必要使用全局变量,资源处理等)。 )的脚本编写方式,如果您想进一步改善程序,可以使用Code Review的帮助。