有人可以帮忙吗?
我编写了一个 Azure 函数,它从 Firebase 流式传输的 BigQuery 中提取数据(Google Analytics 4 for Firebase)。数据每分钟从 Google Analytics 流式传输到 events_intraday_YYYYMMDD 表中,该函数每 5 分钟从 BigQuery 中提取新接收的数据,获取结果并将它们作为 JSON 列表存储在 Azure Blob 上。标准算法是执行 select * from {dataset} where event_timestamp > {lastRuntime}
。该算法的问题在于,即使使用 WHERE 子句,BigQuery 也会对扫描的字节数收费,这并不划算,就好像我一天要运行 10 次查询一样,假设我总共有一整天的 1000 MB 数据将不断以相同的累积字节出现,我将收取大约 5500 MB 的扫描费用(100 + 200 +...1000)。所以要至少有一个 O(N) 算法,我必须在每次运行后删除数据(这不是问题,因为表在一天结束时被删除)。我承认这不是最好的代码,但这是我的第一次尝试:
from google.cloud import bigquery
from google.oauth2 import service_account
import os, datetime, json, time, gzip, re
import azure.storage.blob as azsb
bigquerycreds = json.loads(os.environ['bigquerycreds'])
conn_storage = os.environ['conn_storage']
def uploadJsonGzipBlobBytes(filePathAndName, jsonBody, storageContainerName):
blob = azsb.BlobClient.from_connection_string(
conn_str=conn_storage,
container_name=storageContainerName,
blob_name=filePathAndName
)
blob.upload_blob(jsonBody)
def batch(game_name):
start = time.perf_counter()
lowerFormat = game_name.lower().replace(" ","_")
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
credentials = service_account.Credentials.from_service_account_info(bigquerycreds)
bqClient = bigquery.Client(credentials=credentials)
triggerTimestamp = datetime.datetime.utcnow()
epochTriggerTimestamp = int(triggerTimestamp.timestamp()*1000000)
projectName = bqClient.project
databaseName = [x.dataset_id for x in list(bqClient.list_datasets()) if 'analytics_' in x.dataset_id][0]
tableName = f"events_intraday_{triggerTimestamp:%Y%m%d}"
datasetName = f"{projectName}.{databaseName}.{tableName}"
tempDatasetName = f"{projectName}.{databaseName}.tmp"
# Copy current data into a temp table and truncate main table
print(f"Copying data from {datasetName} to {tempDatasetName} in BigQuery")
sql = f'''\
create or replace table `{tempDatasetName}` as
select * from `{datasetName}`;
truncate table `{datasetName}`;'''
query_job = bqClient.query(sql)
results = query_job.result()
# Get Data from the temp table
print(f"Getting data from {tempDatasetName}")
sql = f'''\
select to_json_string(t) from {tempDatasetName} t
'''
query_job = bqClient.query(sql)
results = list(query_job.result().to_arrow().to_pydict().values())[0]
resultCount = len(results)
if resultCount != 0:
print(f"Compressing data returned from {tempDatasetName}")
dataBytes = gzip.compress(bytes("[" + ",".join(results) + "]", encoding='utf-8'))
# Upload to Azure Blobs
outFile = f"bigquery/live/{triggerTimestamp:%Y/%m/%d/%H}/{lowerFormat}_live_{triggerTimestamp:%Y%m%d%H%M%S}_{datetime.datetime.utcnow():%Y%m%d%H%M%S}.json.gz"
print(f"Comencing to upload data to blob -- {round(time.perf_counter()-start, 2)} sec")
uploadJsonGzipBlobBytes(outFile, dataBytes, storageContainerName)
print(f"File compiled: {outFile} -- {resultCount} rows -- Process Time: {round(time.perf_counter()-start, 2)} sec\n")
else:
print(f"No new data to upload")
print(f"Dropping table {tempDatasetName}")
sql = f'''\
drop table {tempDatasetName}
'''
query_job = bqClient.query(sql)
results = query_job.result()
代码执行以下步骤:
在下一次运行中,我应该只有新行,因此在与上面相同的示例中,我扫描的字节将等于表累积的数量(示例中为 1000 MB)。
不过我好像有问题。我对 Blob 中的行数与 events_YYYYMMDD(每日导出编译的表,第二天可用)之间的行数进行了差距分析,我似乎遗漏了 90%我的数据。我还注意到,一旦我运行 TRUNCATE 脚本,数据将需要一些时间才能再次插入 events_intraday_YYYYMMDD 表中,所以我不确定发生了什么。< /p>