从Google Cloud Storage将.txt文件加载到Pandas DF中

时间:2019-07-15 08:06:23

标签: pandas google-cloud-functions google-cloud-storage

我正在尝试通过pd.read_csv从GCS存储桶中将.txt文件加载到pandas df中。当我在本地计算机上运行此代码(从本地目录中获取.txt文件)时,它可以正常工作。但是,当我尝试在cloud函数中运行代码,但从GCS存储桶访问相同的.txt文件时,出现“ TypeError:无法在类似字节的对象上使用字符串模式”

唯一不同的是,我通过GCS存储桶访问.txt文件,因此它是存储桶对象(Blob)而不是普通文件。在执行pd.read_csv之前,我是否需要先将blob下载为字符串或类似文件的对象?代码在下面

def stage1_cogs_vfc(data, context):  

    from google.cloud import storage
    import pandas as pd
    import dask.dataframe as dd
    import io
    import numpy as np


    start_bucket = 'my_bucket'   
    storage_client = storage.Client()
    source_bucket = storage_client.bucket(start_bucket)

    df = pd.DataFrame()

    file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
    df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')

回溯(最近通话最近一次):

 File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 20, in stage1_cogs_vfc df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python') File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__ self._make_engine(self.engine) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1132, in _make_engine self._engine = klass(self.f, **self.options) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2238, in __init__ self.unnamed_cols) = self._infer_columns() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2614, in _infer_columns line = self._buffered_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2689, in _buffered_line return self._next_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2791, in _next_line next(self.data) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2379, in _read yield pat.split(line.strip()) TypeError: cannot use a string pattern on a bytes-like object
``|

2 个答案:

答案 0 :(得分:1)

我发现了类似的情况here

我也注意到这一点:

source_bucket = storage_client.bucket(source_bucket)

您同时使用“ source_bucket”:变量名和参数。我建议更改其中之一。

但是,我想您希望与该文档有关与API本身相关的任何其他问题:Storage Client - Google Cloud Storage API

答案 1 :(得分:0)

基于@K_immer的点是我更新的代码,其中包括读取“ Dask” df ...

def stage1_cogs_vfc(data, context):  

    from google.cloud import storage
    import pandas as pd
    import dask.dataframe as dd
    import io
    import numpy as np
    import datetime as dt


    start_bucket = 'my_bucket'
    destination_path = 'gs://my_bucket/ddf-*_cogs_vfc.csv'

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(start_bucket)

    blob = bucket.get_blob('SCE_Var_Fact_Costs.txt')

    df0 = pd.DataFrame()

    file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
    df0 = dd.read_csv(file_path,skiprows=12, dtype=object ,encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')


    df7 = df7.compute() # converts dask df to pandas df

    # then do your heavy ETL stuff here using pandas...