我有一个我想使用python和请求调用的rest API。输出通过json传递。我要么想将它们写到Google云存储中,然后安排一个大型查询作业就可以在顶部运行。
或者,到目前为止,这是我的首选路线,我想在它们之上运行大熊猫,然后只写入可以在Google Data Studio中显示的CSV。
有人可以提供最佳的架构方法吗?我应该看什么Google服务?
答案 0 :(得分:0)
如果您要通过向相关API发出相关请求并通过上载到Cloud Storage来创建的CSV(或JSON)文件不是很大(大约KB,可能是几MB,但绝不超过RAM规范offered by Cloud Functions),您可以使用Cloud Functions(如果它是GB级别的巨大文件,我会考虑使用Compute Engine实例或也许是我自己的计算机)。我之所以这样写,是因为在Cloud Functions运行时中,您只能写入与存储在内存中的卷相对应的“ / tmp”目录,因此它将消耗为该功能配置的内存资源(范围为128MB至2048MB)价格会相应地增加),那么使用Compute Engine实例或您自己的计算机作为临时存储来创建文件,然后将其上传到Cloud Storage进行永久存储,您将拥有更大的自由度。
下面的示例代码可以作为基础,您可以使用一些使用Python请求模块查询的JSON内容,从Pandas数据框中创建CSV文件。
a。 requirements.txt
google-cloud-storage
pandas
requests
b。 main.py(使此文件适应特定API的相关请求,因为我的文件只是从xkcd comics API获取JSON输入。如果API需要某种身份验证,请考虑使用environment variables或{ {3}}处理敏感信息)
from google.cloud import storage
import requests
import pandas as pd
import os
from os import path
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
def make_request(request):
#Change URL according to your API endpoint.
url = 'https://xkcd.com/'
json_list = []
# Process the request and turn it into a dataframe
for i in range(1,11):
if i == 1:
response = requests.get(url+str(i)+"/info.0.json")
response.raise_for_status()
#Create the columns on the dataframe based on the json
columns = response.json().keys()
df = pd.DataFrame(columns = columns)
df.loc[len(df)] = response.json()
else:
response = requests.get(url+str(i)+"/info.0.json")
response.raise_for_status()
#Append the new rows to the dataframe
df.loc[len(df)] = response.json()
print(df)
#Creates temporary CSV file
filename = "temp.csv"
full_path = path.join("/tmp",filename)
df.to_csv(full_path)
#Upload it to Cloud Storage
bucket_name = "[YOUR-BUCKET-NAME]" #CHANGE ME
source_file_name = path.join(full_path)
destination_blob_name = "my_csv_file_1.csv" #CHANGE ME
upload_blob(bucket_name, source_file_name, destination_blob_name)
#Remove the temp file after it is uploaded to Cloud Storage to avoid OOM issues with the Cloud Function.
os.remove(full_path)
return "CSV from Dataframe Uploaded to Cloud Storage"
gcloud functions deploy [YOUR-COOL-FUNCTION-NAME] --trigger-http --runtime python37 --entry-point make_request --timeout 540s
由于创建了CSV文件并将其上传到Cloud Storage,因此可以使用Data Studio进行可视化。如果您想运行BigQuery作业,则可以创建一个额外的Cloud Function来执行此任务,但是这次应该由google.storage.object.finalize
Storage Trigger触发,并且一旦文件完成,该Function就会自动触发BigQuery作业完成上传到您的存储桶。找到所有相关信息Cloud SDK installed。