Question

我有一个我想使用python和请求调用的rest API。输出通过json传递。我要么想将它们写到Google云存储中，然后安排一个大型查询作业就可以在顶部运行。

或者，到目前为止，这是我的首选路线，我想在它们之上运行大熊猫，然后只写入可以在Google Data Studio中显示的CSV。

有人可以提供最佳的架构方法吗？我应该看什么Google服务？

Answer 1

如果您要通过向相关API发出相关请求并通过上载到Cloud Storage来创建的CSV（或JSON）文件不是很大（大约KB，可能是几MB，但绝不超过RAM规范offered by Cloud Functions），您可以使用Cloud Functions（如果它是GB级别的巨大文件，我会考虑使用Compute Engine实例或也许是我自己的计算机）。我之所以这样写，是因为在Cloud Functions运行时中，您只能写入与存储在内存中的卷相对应的“ / tmp”目录，因此它将消耗为该功能配置的内存资源（范围为128MB至2048MB）价格会相应地增加），那么使用Compute Engine实例或您自己的计算机作为临时存储来创建文件，然后将其上传到Cloud Storage进行永久存储，您将拥有更大的自由度。

下面的示例代码可以作为基础，您可以使用一些使用Python请求模块查询的JSON内容，从Pandas数据框中创建CSV文件。

Create a bucket，最好使用统一访问权限而不是精细权限来保存存储区名称（稍后在以下描述的main.py文件中进行更改）。
默认情况下，Cloud Functions使用App Engine Default Service Account（已分配了编辑者权限），但是请确保您用于Cloud Functions的帐户至少已分配了Storage Object Creator role
通过创建新目录（例如my-cool-cf）并添加以下文件：

a。 requirements.txt

google-cloud-storage
pandas
requests

b。 main.py（使此文件适应特定API的相关请求，因为我的文件只是从xkcd comics API获取JSON输入。如果API需要某种身份验证，请考虑使用environment variables或{ {3}}处理敏感信息）

from google.cloud import storage
import requests
import pandas as pd
import os
from os import path


def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to the bucket."""
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(source_file_name)
    print(
        "File {} uploaded to {}.".format(
            source_file_name, destination_blob_name
        )
    )

def make_request(request):
    #Change URL according to your API endpoint.
    url = 'https://xkcd.com/' 
    json_list = []
    # Process the request and turn it into a dataframe
    for i in range(1,11):
        if i == 1:
            response = requests.get(url+str(i)+"/info.0.json")
            response.raise_for_status()
            #Create the columns on the dataframe based on the json
            columns = response.json().keys()
            df = pd.DataFrame(columns = columns)
            df.loc[len(df)] = response.json()
        else:
            response = requests.get(url+str(i)+"/info.0.json")
            response.raise_for_status()
            #Append the new rows to the dataframe
            df.loc[len(df)] = response.json()
    print(df)
    #Creates temporary CSV file
    filename = "temp.csv"
    full_path = path.join("/tmp",filename)
    df.to_csv(full_path)
    #Upload it to Cloud Storage
    bucket_name = "[YOUR-BUCKET-NAME]" #CHANGE ME
    source_file_name = path.join(full_path)
    destination_blob_name = "my_csv_file_1.csv" #CHANGE ME
    upload_blob(bucket_name, source_file_name, destination_blob_name)
    #Remove the temp file after it is uploaded to Cloud Storage to avoid OOM issues with the Cloud Function. 
    os.remove(full_path)
    
    return "CSV from Dataframe Uploaded to Cloud Storage"

假设您拥有Cloud KMS，可以通过发出以下命令来部署该功能：

gcloud functions deploy [YOUR-COOL-FUNCTION-NAME] --trigger-http --runtime python37 --entry-point make_request --timeout 540s

由于创建了CSV文件并将其上传到Cloud Storage，因此可以使用Data Studio进行可视化。如果您想运行BigQuery作业，则可以创建一个额外的Cloud Function来执行此任务，但是这次应该由google.storage.object.finalize Storage Trigger触发，并且一旦文件完成，该Function就会自动触发BigQuery作业完成上传到您的存储桶。找到所有相关信息Cloud SDK installed。

Google Cloud中的请求和熊猫

1 个答案: