我正在尝试将Google Cloud Storage存储桶中的csv文件读取到熊猫数据框中。
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)
显示以下错误消息:
FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
我做错了什么,我找不到任何不涉及谷歌datalab的解决方案?
答案 0 :(得分:30)
截至0.24版本的pandas,read_csv
支持直接从Google云端存储中读取。只需提供链接到这样的桶:
df = pd.read_csv('gs://bucket/your_path.csv')
为了完整起见,我还留下了其他三个选项。
我将在下面介绍它们。
我已经编写了一些便利功能,可以从Google存储中读取。为了使其更具可读性,我添加了类型注释。如果你碰巧在Python 2上,只需删除它们,代码将完全相同。
假设您获得授权,它在公共和私人数据集上同样有效。在这种方法中,您无需先将数据下载到本地驱动器。
如何使用它:
fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)
代码:
from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account
def get_byte_fileobj(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> BytesIO:
"""
Retrieve data from a given blob on Google Storage and pass it as a file object.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: file object (BytesIO)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return byte_stream
def get_bytestring(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> bytes:
"""
Retrieve data from a given blob on Google Storage and pass it as a byte-string.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: byte-string (needs to be decoded)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
s = blob.download_as_string()
return s
def _get_blob(bucket_name, path, project, service_account_credentials_path):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return blob
gcsfs是“用于Google云端存储的Pythonic文件系统”。
如何使用它:
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
df = pd.read_csv(f)
Dask“为分析提供高级并行性,为您喜爱的工具提供大规模性能”。当您需要在Python中处理大量数据时,它非常棒。 Dask尝试模仿pandas
API的大部分内容,使其易于用于新手。
以下是read_csv
如何使用它:
import dask.dataframe as dd
df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!
# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()
答案 1 :(得分:12)
另一种选择是使用TensorFlow,它具有从Google Cloud Storage进行流式读取的功能:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
df = pd.read_csv(f)
使用tensorflow还为您提供了一种方便的方式来处理文件名中的通配符。例如:
以下代码会将符合特定模式(例如gs:// bucket / some / dir / train- *)的所有CSV读取到Pandas数据框中:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd
def read_csv_file(filename):
with file_io.FileIO(filename, 'r') as f:
df = pd.read_csv(f, header=None, names=['col1', 'col2'])
return df
def read_csv_files(filename_pattern):
filenames = tf.gfile.Glob(filename_pattern)
dataframes = [read_csv_file(filename) for filename in filenames]
return pd.concat(dataframes)
DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))
答案 2 :(得分:3)
read_csv
不支持gs://
字符串可以是URL。有效的URL方案包括http,ftp,s3, 和文件。对于文件URL,需要主机。例如,一个本地人 文件可以是file://localhost/path/to/table.csv
答案 3 :(得分:3)
我正在看这个问题,不想经历安装另一个库的麻烦,gcsfs
,在文档中字面意思是,This software is beta, use at your own risk
...但是我找到了一个很好的解决方法,我想在这里发布,以防这对其他人有帮助,仅使用 google.cloud 存储库和一些本机 python 库。这是函数:
import pandas as pd
from google.cloud import storage
import os
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/creds.json'
def gcp_csv_to_df(bucket_name, source_file_name):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
data = blob.download_as_string()
df = pd.read_csv(io.BytesIO(data))
print(f'Pulled down file from bucket {bucket_name}, file name: {source_file_name}')
return df
此外,虽然这超出了本问题的范围,但如果您想使用类似的函数将 Pandas 数据帧上传到 GCP,请使用以下代码:
def df_to_gcp_csv(df, dest_bucket_name, dest_file_name):
storage_client = storage.Client()
bucket = storage_client.bucket(dest_bucket_name)
blob = bucket.blob(dest_file_name)
blob.upload_from_string(df.to_csv(), 'text/csv')
print(f'DataFrame uploaded to bucket {dest_bucket_name}, file name: {dest_file_name}')
希望对你有帮助!我知道我肯定会使用这些功能。
答案 4 :(得分:2)
从pandas==0.24.0
开始,如果您已安装gcsfs
,则本机支持:https://github.com/pandas-dev/pandas/pull/22704。
在正式发布之前,您可以使用pip install pandas==0.24.0rc1
试用。
答案 5 :(得分:1)
在GCS中有三种方式访问文件:
答案 6 :(得分:1)
如果我正确理解了您的问题,那么此链接可以帮助您为 read_csv()功能获得更好的网址:
答案 7 :(得分:0)
如果加载压缩文件,则仍然需要使用import gcsfs
。
在{d1中尝试pd.read_csv('gs://your-bucket/path/data.csv.gz')
。版本 => 0.25.3出现以下错误,
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
438 # See https://github.com/python/mypy/issues/1297
439 fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 440 filepath_or_buffer, encoding, compression
441 )
442 kwds["compression"] = compression
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
211
212 if is_gcs_url(filepath_or_buffer):
--> 213 from pandas.io import gcs
214
215 return gcs.get_filepath_or_buffer(
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module>
3
4 gcsfs = import_optional_dependency(
----> 5 "gcsfs", extra="The gcsfs library is required to handle GCS files"
6 )
7
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
91 except ImportError:
92 if raise_on_missing:
---> 93 raise ImportError(message.format(name=name, extra=extra)) from None
94 else:
95 return None
ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.
答案 8 :(得分:0)
从 Pandas 1.2 开始,将 Google 存储中的文件加载到 DataFrame 中变得非常容易。
如果您在本地机器上工作,它看起来像这样:
df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
storage_options={"token": "credentials.json"})
您将来自 google 的credentials.json 文件作为令牌添加为导入。
如果您在谷歌云上工作,请执行以下操作:
df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
storage_options={"token": "cloud"})