我想使用Data Factory V2执行带有依赖项(egg或zip文件)的PySpark作业。
当以节点提交方式直接在头节点集群(HD Insight)上运行命令时,它如下(并且有效):
spark-submit --py-files 0.3-py3.6.egg main.py 1
在Data Factory(V2)中,我尝试定义以下内容:
{
"name": "dimension",
"properties": {
"activities": [{
"name": "Spark1",
"type": "HDInsightSpark",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"rootPath": "adfspark",
"entryFilePath": "main.py",
"getDebugInfo": "Always",
"sparkConfig": {
"spark.submit.pyFiles": "0.3-py3.6.egg"
},
"sparkJobLinkedService": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "hdinsightlinkedService",
"type": "LinkedServiceReference"
}
}
]
}
}
所有这一切都在" adfspark"是容器,依赖项位于" pyFiles"文件夹很像Azure文档中建议的: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-transform-data-spark-powershell
仅在头节点上运行作业将是一个充分的开始,尽管分布式执行是真正的目标