用于从Google存储桶将多个csv文件读取到1个Pandas DataFrame中的“ For”循环

时间:2019-09-18 00:30:36

标签: python-3.x pandas google-cloud-storage dask

我目前有31个.csv文件(所有文件都具有相同的结构-60列宽,约5000行深),我正尝试使用“ FOR”循环将其从Google存储桶中读取到1个熊猫数据框中而且我在6分钟后仍然收到“超时”错误。

进行一些测试后,我注意到我可以一次读取一个.csv文件,但是一旦引入2个或更多文件,就会出现超时错误。这使我认为我的代码是问题,而不是数据的大小。

下面的代码(我应该在for循环的任何阶段使用pd.concat吗?)帮助将不胜感激

def stage1eposdata(data, context):  

    from google.cloud import storage
    from google.cloud import bigquery
    import pandas as pd
    import dask.dataframe as dd
    import io
    import numpy as np
    import datetime as dt
    from googleapiclient import discovery
    from pandas.io.json import json_normalize
    import google.auth
    import math

    destination_path1 = 'gs://staged_data/ddf-*_stet.csv'  

## Source Buckets #
    raw_epos_bucket = 'raw_data'
    cleaned_epos_bucket = 'staged_data'

    # Confirming Oauth #
    storage_client = storage.Client()
    bigquery_client = bigquery.Client()

    # Confirming Connection #
    raw_epos_data = storage_client.bucket(raw_epos_bucket)
    cleaned_epos_data = storage_client.bucket(cleaned_epos_bucket)

    df  = pd.DataFrame()

    for file in list(raw_epos_data.list_blobs(prefix='2019/')):
        file_path="gs://{}/{}".format(file.bucket.name, file.name)
        df = df.append(pd.read_csv(file_path),sort =False)

    ddf = dd.from_pandas(df,npartitions=1, sort=True)
    ddf.to_csv(destination_path1, index=True, sep=',')

2 个答案:

答案 0 :(得分:2)

尝试一下:

## Source Buckets #
    raw_epos_bucket = 'raw_data'
    cleaned_epos_bucket = 'staged_data'

    # Confirming Oauth #
    storage_client = storage.Client()
    bigquery_client = bigquery.Client()

    # Confirming Connection #
    raw_epos_data = storage_client.bucket(raw_epos_bucket)
    cleaned_epos_data = storage_client.bucket(cleaned_epos_bucket)


    my_dataframe_list=[]

    for file in list(raw_epos_data.list_blobs(prefix='2019/')):
        file_path="gs://{}/{}".format(file.bucket.name, file.name)
        my_dataframe_list.append(pd.read_csv(file_path))

    df=pd.concat(my_dataframe_list)
    ddf = dd.from_pandas(df,npartitions=1, sort=True)
    ddf.to_csv(destination_path1, index=True, sep=',')

pd.concat联接DataFrame的列表。因此,在循环的每次迭代中,您都将数据帧保留在列表my_dataframe_list中,并在循环之外将列表连接起来。 如果列匹配,它应该可以工作。

答案 1 :(得分:0)

事实证明,由于dask具有“惰性”计算功能,它可以很好地完成这种事情。我的解决方案如下

## Source Buckets #
raw_epos_bucket = 'raw_data'
cleaned_epos_bucket = 'staged_data'

# Confirming Oauth #
storage_client = storage.Client()
bigquery_client = bigquery.Client()

# Confirming Connection #
raw_epos_data = storage_client.bucket(raw_epos_bucket)
cleaned_epos_data = storage_client.bucket(cleaned_epos_bucket)

my_dataframe_list = []
my_dataframe_list = dd.read_csv('gs://raw_data/*.csv')# '*' is wild card no need to do any more 'For' Loops!

ddf = dd.from_pandas(df,npartitions=1, sort=True)
ddf.to_csv(destination_path1, index=True, sep=',')