有效地将Pandas数据帧写入Google BigQuery

时间:2018-02-20 13:47:17

标签: python pandas google-bigquery google-cloud-storage google-cloud-python

我尝试使用记录herepandas.DataFrame函数将pandas.DataFrame.to_gbq()上传到Google大查询。问题是to_gbq()需要2.3分钟,而直接上传到Google云端存储GUI只需不到一分钟。我计划上传一堆数据帧(~32),每个数据帧大小相近,所以我想知道它是更快的选择。

这是我正在使用的脚本:

dataframe.to_gbq('my_dataset.my_table', 
                 'my_project_id',
                 chunksize=None, # i've tryed with several chunksizes, it runs faster when is one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

我的问题是,什么更快?

  1. 使用Dataframe功能
  2. 上传pandas.DataFrame.to_gbq()
  3. Dataframe保存为csv,然后使用Python API
  4. 将文件作为文件上传到BigQuery
  5. Dataframe保存为csv,然后使用this procedure将文件上传到Google云端存储,然后从BigQuery中读取
  6. 更新

    替代方案2,使用pd.DataFrame.to_csv()load_data_from_file()似乎需要比替代1更长的时间(平均3个循环多17.9秒):

    def load_data_from_file(dataset_id, table_id, source_file_name):
        bigquery_client = bigquery.Client()
        dataset_ref = bigquery_client.dataset(dataset_id)
        table_ref = dataset_ref.table(table_id)
    
        with open(source_file_name, 'rb') as source_file:
            # This example uses CSV, but you can use other formats.
            # See https://cloud.google.com/bigquery/loading-data
            job_config = bigquery.LoadJobConfig()
            job_config.source_format = 'text/csv'
            job_config.autodetect=True
            job = bigquery_client.load_table_from_file(
                source_file, table_ref, job_config=job_config)
    
        job.result()  # Waits for job to complete
    
        print('Loaded {} rows into {}:{}.'.format(
            job.output_rows, dataset_id, table_id))
    
    谢谢你!

1 个答案:

答案 0 :(得分:5)

我使用以下代码对Datalab中的备选1和3进行了比较:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                 Context.default().project_id,
                 chunksize=10000, 
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

以下是n = {10000,100000,1000000}:

的结果
n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

从结果来看,替代方案3比替代方案1更快。