Question

我正在尝试通过将文件上传到Cloud Storage，通过管道运行它们并下载结果来自动化一些数据清理任务。

我已为管道创建了模板，以使用Dataprep中的GUI执行该管道，并正在尝试使用Google客户库（尤其是Python）自动上传和执行模板。

但是，我发现使用Python脚本运行作业时，不会执行完整的模板。有时某些步骤未完成，有时输出文件（应为MegaBytes大）小于500字节。这取决于我使用的模板。每个模板都有自己的问题。

我尝试将大模板分解为较小的模板以连续应用，这样我可以看到问题出在哪里，但是那是我发现每个模板都有自己的问题的地方。我还尝试了从数据流监视界面创建作业，并且发现使用该流程创建的任何内容都可以完美运行，这意味着我创建的脚本一定存在问题。

def runJob(bucket, template, fileName):
    #open connection with the needed credentials
    credentials = GoogleCredentials.get_application_default()
    service = build('dataflow', 'v1b3', credentials = credentials)

    #name job after file being processed
    jobName = fileName.replace('.csv', '')
    projectId = 'my-project'

    #find the template to run on the dataset
    templatePath = "gs://{bucket}/me@myemail.com/temp/{template}".format(bucket = bucket, template=template)
    #construct job JSON 
    body = {
        "jobName":"{jobName}".format(jobName=jobName),
        "parameters" : {
            "inputLocations" :"{\"location1\":\"gs://" + bucket  + "/me@myemail.com/RawUpload/" + fileName + "\"}",
            "outputLocations":"{\"location1\":\"gs://" + bucket  + "/me@myemail.com/CleanData/" + fileName.replace('.csv', '_auto_delete_2') + "\"}",

        },
        "environment" : {
            "tempLocation":"gs://{bucket}/me@myemail.com/temp".format(bucket = bucket),
            "zone":"us-central1-f"
        }
    }
    #create and execute HTTPRequest
    request = service.projects().templates().launch(projectId=projectId, gcsPath=templatePath, body=body)
    response = request.execute()
    #notify user
    print(response)

使用JSON格式，我对参数的输入与使用监控界面时的输入相同。这告诉我，监视接口的后台中发生了某些我不知道的事情，因此没有包括在内，或者我创建的代码有问题。

如上所述，这个问题因我尝试运行的模板而异，但最常见的是输出文件非常小。输出文件的大小将小于其应有的大小。这是因为它仅包含CSV标头和数据中第一行的一些随机样本，并且对于CSV文件而言，其格式也不正确。

有人知道我所缺少的内容吗？还是知道我在做错什么？

使用客户端库运行但作业未完成但未引发任何错误

0 个答案: