目前,我正在使用Google Big查询来匹配两个表,并将查询结果导出为GCS中的CSV文件。
我找不到通过API直接将查询结果导出为GCS中的CSV的方法,而谷歌云控制台则可以随时使用。
所以我使用与查询输出相同的列名和类型动态创建表,并在表中插入记录。但是,只要我这样做,就会在实际表填充数据之前创建Streaming缓冲区。大约需要90分钟才能成熟。
如果我在GCS中将该表(带有活动流缓冲区)导出为CSV,有时导出的文件只包含列标题,并且不会导出行数据。
有没有办法克服这种情况?
我正在使用google cloud php API代码。
以下是示例工作代码。
$bigQuery = new BigQueryClient();
try{
$res = $bigQuery->dataset(DATASET)->createTable($owner_table, [
'schema' => [
'fields' => $ownerFields
]
]);
}
catch (Exception $e){
//echo 'Message: ' .$e->getMessage();
//die();
$custom_error_message = "Error while creating owner table schema. Trying to create with ".$ownerFields;
sendErrorEmail($e->getMessage(), $custom_error_message);
return false;
}
$bigQuery = new BigQueryClient([
'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table($tableId);
// load the storage object
$storage = new StorageClient([
'projectId' => $projectId,
]);
$object = $storage->bucket($bucketName)->object($objectName);
// create the import job
$job = $table->loadFromStorage($object, $options);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
//print('Waiting for job to complete' . PHP_EOL);
$job->reload();
if (!$job->isComplete()) {
//throw new Exception('Job has not yet completed', 500);
// sendErrorEmail("Job has not yet completed", "Error while import from storage.");
}
});
{
//Code omitted for brevity
}
$queryResults = $job->queryResults();
if ($queryResults->isComplete()) {
$i = 0;
$rows = $queryResults->rows();
$dataRows = [];
foreach ($rows as $row) {
//pr($row); die();
// printf('--- Row %s ---' . PHP_EOL, ++$i);
++$i;
$inlineRow = [];
$inlineRow['insertId'] = $i;
foreach ($row as $column => $value) {
// printf('%s: %s' . PHP_EOL, $column, $value);
$arr[$column] = $value;
}
$inlineRow['data']= $arr;
$dataRows[] = $inlineRow;
}
/* Create a new result table to store the query result*/
if(createResultTable($schema,$type))
{
if(count($dataRows) > 0)
{
sleep(120); //force sleep to mature the result table
$ownerTable = $bigQuery->dataset(DATASET)->table($tableId);
$insertResponse = $ownerTable->insertRows($dataRows);
if (!$insertResponse->isSuccessful()) {
//print_r($insertResponse->failedRows());
sendErrorEmail($insertResponse->failedRows(),"Failed to create resulted table for".$type);
return 0;
}
else
{
return 1;
}
}
else
{
return 2;
}
}
else
{
return 0;
}
//printf('Found %s row(s)' . PHP_EOL, $i);
}
function export_table($projectId, $datasetId, $tableId, $bucketName, $objectName, $format = 'CSV')
{
$bigQuery = new BigQueryClient([
'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table($tableId);
// load the storage object
$storage = new StorageClient([
'projectId' => $projectId,
]);
$destinationObject = $storage->bucket($bucketName)->object($objectName);
// create the import job
$options = ['jobConfig' => ['destinationFormat' => $format]];
$job = $table->export($destinationObject, $options);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
//print('Waiting for job to complete' . PHP_EOL);
$job->reload();
if (!$job->isComplete()) {
//throw new Exception('Job has not yet completed', 500);
return false;
}
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
$error = $job->info()['status']['errorResult']['message'];
//printf('Error running job: %s' . PHP_EOL, $error);
sendErrorEmail($job->info()['status'], "Error while exporting resulted table. File Name: ".$objectName);
return false;
} else {
return true;
}
}
问题出现在#5上。由于#4中的第3个表位于流缓冲区中,因此导出作业无法成功运行。
有时它能够正确导出表格,有时候却没有完成。我们试图在#4和#5之间休息约120秒,但问题仍然存在。
我已经看到通过API动态生成的表在流缓冲区中持续90分钟。但这太高了。