在Azure中创建数据管道

时间:2019-12-12 13:03:09

标签: azure powershell apache-spark azure-data-factory

我有一个类,可以进行一些提取,将负载转换为位于不同JSON文件中的数据集。

此过程正常。但是,我有必要每月手动处理。我在intelliJ中提交了一个spark应用程序(并在转换后提交了Scalla Singleton对象)

因此,我正在尝试使此过程自动化。但是,我找不到文档或教程来知道什么是实现此目标的最佳服务。

过程应:

  1. 创建HDInsight Spark集群
  2. 运行流程(一个Scala类)
  3. 删除之前创建的HDInsight Spark集群

我已经搜索过,但是找到的链接(寻找“按需创建HD见解火花集群”)如下:

我搜索过的其他选项:

谢谢!

1 个答案:

答案 0 :(得分:0)

这是您想要的过程

  1. 创建HDInsight Spark集群

使用Power Shell,应该很容易创建HDInsight集群,这是示例代码:

### Create a Spark 2.3 cluster in Azure HDInsight

# Default cluster size (# of worker nodes), version, and type
$clusterSizeInNodes = "1"
$clusterVersion = "3.6"
$clusterType = "Spark"

# Create the resource group
$resourceGroupName = Read-Host -Prompt "Enter the resource group name"
$location = Read-Host -Prompt "Enter the Azure region to create resources in, such as 'Central US'"
$defaultStorageAccountName = Read-Host -Prompt "Enter the default storage account name"

New-AzResourceGroup -Name $resourceGroupName -Location $location

# Create an Azure storage account and container
# Note: Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.
New-AzStorageAccount `
    -ResourceGroupName $resourceGroupName `
    -Name $defaultStorageAccountName `
    -Location $location `
    -SkuName Standard_LRS `
    -Kind StorageV2 `
    -EnableHttpsTrafficOnly 1

$defaultStorageAccountKey = (Get-AzStorageAccountKey `
                                -ResourceGroupName $resourceGroupName `
                                -Name $defaultStorageAccountName)[0].Value

$defaultStorageContext = New-AzStorageContext `
                                -StorageAccountName $defaultStorageAccountName `
                                -StorageAccountKey $defaultStorageAccountKey

# Create a Spark 2.3 cluster
$clusterName = Read-Host -Prompt "Enter the name of the HDInsight cluster"

# Cluster login is used to secure HTTPS services hosted on the cluster
$httpCredential = Get-Credential -Message "Enter Cluster login credentials" -UserName "admin"

# SSH user is used to remotely connect to the cluster using SSH clients
$sshCredentials = Get-Credential -Message "Enter SSH user credentials" -UserName "sshuser"

# Set the storage container name to the cluster name
$defaultBlobContainerName = $clusterName

# Create a blob container. This holds the default data store for the cluster.
New-AzStorageContainer `
    -Name $clusterName `
    -Context $defaultStorageContext

$sparkConfig = New-Object "System.Collections.Generic.Dictionary``2[System.String,System.String]"
$sparkConfig.Add("spark", "2.3")

# Create the HDInsight cluster
New-AzHDInsightCluster `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $clusterName `
    -Location $location `
    -ClusterSizeInNodes $clusterSizeInNodes `
    -ClusterType $clusterType `
    -OSType "Linux" `
    -Version $clusterVersion `
    -ComponentVersion $sparkConfig `
    -HttpCredential $httpCredential `
    -DefaultStorageAccountName "$defaultStorageAccountName.blob.core.windows.net" `
    -DefaultStorageAccountKey $defaultStorageAccountKey `
    -DefaultStorageContainer $clusterName `
    -SshCredential $sshCredentials

Get-AzHDInsightCluster `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $clusterName
  1. 运行流程(一个Scala类)

您可以引用此链接将应用程序作业远程提交到Spark集群:

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-create-standalone-application#run-the-application-on-the-apache-spark-cluster

  1. 删除之前创建的HDInsight Spark集群

清理集群,可以使用powershell来实现,这是相同的示例代码;

# Removes the specified HDInsight cluster from the current subscription.
Remove-AzHDInsightCluster `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $clusterName

# Removes the specified storage container.
Remove-AzStorageContainer `
    -Name $clusterName `
    -Context $defaultStorageContext

# Removes a Storage account from Azure.
Remove-AzStorageAccount `
    -ResourceGroupName $resourceGroupName `
    -Name $defaultStorageAccountName

# Removes a resource group.
Remove-AzResourceGroup `
    -Name $resourceGroupName

其他参考:

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql-use-powershell

https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-powershell.md

希望有帮助。