如何将数据从Google Cloud Storage导入到Google Colab

时间:2018-08-06 20:34:33

标签: google-cloud-storage google-colaboratory

当前,我正在处理10 GB的数据集。我已经将其上传到了Google云存储中,但是我不知道如何将其导入到Google Colab中。

3 个答案:

答案 0 :(得分:12)

from google.colab import auth
auth.authenticate_user()

运行此命令后,将生成一个链接,您可以单击它并完成登录。

!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

使用它在colab上安装gcsfuse。 Cloud Storage FUSE是一种开源FUSE适配器,可用于将Cloud Storage存储桶作为文件系统挂载在Colab,Linux或macOS系统上。

!mkdir folderOnColab
!gcsfuse folderOnBucket/content/ folderOnColab

使用它来挂载目录。 (folderOnBucket是不带gs://部分的GCS存储桶URL)

您可以使用此文档进行进一步阅读。 https://cloud.google.com/storage/docs/gcs-fuse

答案 1 :(得分:2)

文档在External data: Drive, Sheets, and Cloud Storage中对此进行了介绍...

甚至还有一个Importing data using the Cloud Storage Python API代码段。

答案 2 :(得分:2)

使用专用服务帐户和Python:

from google.oauth2 import service_account
from google.cloud.storage import client
import io
import pandas as pd
from io import BytesIO
import json
import filecmp

使用服务帐户令牌作为str:

SERVICE_ACCOUNT = json.loads(r"""{
  "type": "service_account",
  "project_id": "[REPLACE WITH YOUR FILE]",
  "privat_sae_key_id": "[REPLACE WITH YOUR FILE]",
  "private_key": "[REPLACE WITH YOUR FILE]",
  "client_email": "[REPLACE WITH YOUR FILE]",
  "client_id": "[REPLACE WITH YOUR FILE]",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "[REPLACE WITH YOUR FILE]"
}""")

BUCKET = "[NAME OF YOUR BUCKET TO READ/WITE YOUR DATA]"

使用服务令牌创建客户端:

credentials = service_account.Credentials.from_service_account_info(
    SERVICE_ACCOUNT,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

client = client.Client(
    credentials=credentials,
    project=credentials.project_id,
)

保存和下载功能:

def save_file(local_filename, remote_filename):
    bucket = client.get_bucket(BUCKET)
    blob = bucket.blob(remote_filename)
    blob.upload_from_filename(local_filename)

def download_file(local_filename, remote_filename):
    bucket = client.get_bucket(BUCKET)
    blob = bucket.blob(remote_filename)
    blob.download_to_filename(local_filename)

让我们检查一下由Pandas生成的CSV文件:

df_test = pd.DataFrame(
    {"col1": [1,2,3],
     "col2": [4,5,6]}
).to_csv(path_or_buf="/tmp/test.csv")

save_file("/tmp/test.csv","test.csv")
download_file("/tmp/test2.csv","test.csv")
assert filecmp.cmp('/tmp/test.csv', '/tmp/test2.csv')