I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.
I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.
How should I proceed (I would like to keep using pandas) ?
import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")
output :
ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist
答案 0 :(得分:3)
您需要使用tensorflow.python.lib.io中的file_io来执行此操作,如下所示:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
import pandas as pd
# read the input data
def read_data(gcs_path):
print('downloading csv file from', gcs_path)
file_stream = file_io.FileIO(gcs_path, mode='r')
data = pd.read_csv(StringIO(file_stream.read()))
return data
现在调用上面的函数
df = read_data('gs://bucket/folder/file.csv')
# print(df.head()) # display top 5 rows including headers
答案 1 :(得分:1)
Pandas没有本机GCS支持。有两种选择: 1.使用gsutil cli将文件复制到VM 2.使用TensorFlow file_io库打开文件,并将文件对象传递给pd.read_csv()。请参阅详细答案here。
答案 2 :(得分:0)
您还可以使用Dask提取数据,然后将其加载到在GCP上运行的Jupyter Notebook中。
确保已安装Dask。
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
这就是装载数据所需的全部。
您现在可以使用Pandas语法过滤和处理数据。
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()