使用Azure Data Factory V2执行具有依赖关系的PySpark作业

时间:2018-05-27 13:49:10

标签: azure apache-spark pyspark azure-data-factory azure-data-factory-2

我想使用Data Factory V2执行带有依赖项(egg或zip文件)的PySpark作业。

当以节点提交方式直接在头节点集群(HD Insight)上运行命令时,它如下(并且有效):

spark-submit --py-files 0.3-py3.6.egg main.py 1

在Data Factory(V2)中,我尝试定义以下内容:

{
    "name": "dimension",
    "properties": {
        "activities": [{
                "name": "Spark1",
                "type": "HDInsightSpark",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false
                },
                "typeProperties": {
                    "rootPath": "adfspark",
                    "entryFilePath": "main.py",
                    "getDebugInfo": "Always",
                    "sparkConfig": {
                        "spark.submit.pyFiles": "0.3-py3.6.egg"
                    },
                    "sparkJobLinkedService": {
                        "referenceName": "AzureStorageLinkedService",
                        "type": "LinkedServiceReference"
                    }
                },
                "linkedServiceName": {
                    "referenceName": "hdinsightlinkedService",
                    "type": "LinkedServiceReference"
                }
            }
        ]
    }
}

所有这一切都在" adfspark"是容器,依赖项位于" pyFiles"文件夹很像Azure文档中建议的: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-transform-data-spark-powershell

仅在头节点上运行作业将是一个充分的开始,尽管分布式执行是真正的目标

0 个答案:

没有答案