Question

在我的Google Cloud函数（Python 3.7运行时）中，我创建了一个函数，该函数试图将所有.csv文件从google存储桶下载到熊猫数据框（df）。进入数据框后，我将对其进行一些简单的ETL工作，然后将其转换回一个大的.csv文件，以保存到另一个存储桶。
我遇到的问题是，当我到达将对象（使用file.download_as_string（）转换为字符串）读入read_csv（）的点时，出现与IO.StringIO相关的错误（TypeError：initial_value必须为str或无，不是字节）

在read_csv（io.String.IO（file_contents）....）中，这与放置io.StringIO方法的位置有关吗？谁能帮助我纠正此错误？

    def stage1slemonthly(data, context, source_bucket = 'my_source_bucket', 
    destination_bucket = 'gs://my destination_bucket'):  


        from google.cloud import storage
        import pandas as pd
        import pyspark
        from pyspark.sql import SQLContext
        import io

        storage_client = storage.Client()

        # source_bucket = data['bucket']
        # source_file = data['name']
        source_bucket = storage_client.bucket(source_bucket)

        # load in the col names
        col_names = ["Customer_Country_Number", "Customer_Name", "Exclude",
             "SAP_Product_Name", "CP_Sku_Code", "Exclude", "UPC_Unit",
             "UPC_Case", "Colgate_Month_Year", "Total_Cases",
             "Promoted_Cases", "Non_Promoted_Cases",
             "Planned_Non_Promoted_Cases", "Exclude",
             "Lead_Measure", "Tons", "Pieces", "Liters",
             "Tons_Conv_Revenue", "Volume_POS_Units", "Scan_Volume",
             "WWhdrl_Volume", "Exclude", "Exclude", "Exclude", "Exclude",
             "Exclude", "Exclude", "Exclude", "Exclude", "Investment_Buy",
             "Exclude", "Exclude", "Gross_Sales", "Claim_Sales",
             "Adjusted_Gross_Sales", "Exclude", "Exclude",
             "Consumer_Investment", "Consumer_Allowance",
             "Special_Pack_FG", "Coupons", "Contests_Offers", 
             "Consumer_Price_Reduction", "Permanent_Price_Reduction",
             "Temporary_Price_Reduction", "TPR_Off_Invoice", "TPR_Scan",
             "TPR_WWdrwl_Exfact", "Every_Day_Low_Price", "Closeouts",
             "Inventory_Price_Reduction", "Exclude", "Customer_Investment",
             "Prompt_Payment", "Efficiency_Drivers", "Efficient_Logistics",
             "Efficient_Management", "Business_Builders_Direct", "Assortment",
             "Customer_Promotions","Customer_Promotions_Terms",
             "Customer_Promotions_Fixed", "Growth_Direct",
             "New_Product_Incentives", "Free_Goods_Direct",
             "Shopper_Marketing", "Business_Builders_Indirect",
             "Middleman_Performance", "Middleman_Infrastructure",
             "Growth_Indirect", "Indirect_Retailer_Investments",
             "Free_Goods_Indirect", "Other_Customer_Investments",
             "Product_Listing_Allowances", "Non_Performance_Trade_Payments",
             "Exclude", "Variable_Rebate_Adjustment", 
             "Overlapping_OI_Adjustment", "Fixed_Accruals",
             "Variable_Accruals", "Total_Accruals", "Gross_To_Net",
             "Invoiced_Sales", "Exclude", "Exclude", "Net_Sales",
             "Exclude", "Exclude", "Exclude", "Exclude", "Exclude",
             "Exclude", "Exclude", "Exclude", "Exclude",
             "Total_Variable_Cost", "Margin", "Exclude"]

        df = pd.DataFrame(columns=[col_names])

        for file in list(source_bucket.list_blobs()):
          file_contents = file.download_as_string() 
          df = df.append(pd.read_csv(io.StringIO(file_contents), header=None, names=[col_names]))

        df = df.reset_index(drop=True)

        # do ETL work here in future

        sc = pyspark.SparkContext.getOrCreate()
        sqlCtx = SQLContext(sc)
        sparkDf = sqlCtx.createDataFrame(df)
        sparkDf.coalesce(1).write.option("header", "true").csv(destination_bucket)

运行它时，出现以下错误消息...

回溯（最近一次通话最近）：文件“ /env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py”，行383，在run_background_function _function_handler.invoke_user_function（event_object）中在invoke_user_function返回中的文件“ /env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py”中，第217行return call_user_function（request_or_event）文件“ /env/local/lib/python3.7 /site-packages/google/cloud/functions/worker.py“，第214行，位于call_user_function event_context.Context（** request_or_event.context））中，文件“ /user_code/main.py”，第56行，位于stage1slemonthly df = df .append（pd.read_csv（io.StringIO（file_contents），标头=无，名称= [col_names]））TypeError：initial_value必须为str或None，不是字节

Answer 1

由于file.download_as_string() return type是bytes而不是str，并且您不能将io.StringIO与bytes参数一起使用，您会收到此错误消息（ initial_value=file_contents。

此外，col_names在这里被定义为数组，因此写入pd.DataFrame(columns=[col_names])和pd.read_csv(..., names=[col_names])是不正确的：您应该使用col_names而不是[col_names]。 / p>

无论如何，这似乎不是从Google Cloud Storage中读取CSV文件的正确方法。您宁愿写：

from google.cloud import storage
import pandas as pd
import io

storage_client = storage.Client()

source_bucket = storage_client.bucket(source_bucket)

col_names = ["Customer_Country_Number", "Customer_Name", ...]

df = pd.DataFrame(columns=col_names)

for file in list(source_bucket.list_blobs()):
    file_path="gs://{}/{}".format(file.bucket.name, file.name)
    df = df.append(pd.read_csv(file_path, header=None, names=col_names))

# the rest of your code

实际上，您可以使用read_csv的{{1}}方法read files directly from GCS而不是下载文件来加载它，但是您需要安装pandas（gcsfs ）。

将所有.csv文件从Google存储桶中读取到一个大熊猫df中，然后另存为.csv到另一个存储桶中

1 个答案: