Question

我有一个类，可以进行一些提取，将负载转换为位于不同JSON文件中的数据集。

此过程正常。但是，我有必要每月手动处理。我在intelliJ中提交了一个spark应用程序（并在转换后提交了Scalla Singleton对象）

因此，我正在尝试使此过程自动化。但是，我找不到文档或教程来知道什么是实现此目标的最佳服务。

过程应：

创建HDInsight Spark集群
运行流程（一个Scala类）
删除之前创建的HDInsight Spark集群

我已经搜索过，但是找到的链接（寻找“按需创建HD见解火花集群”）如下：

我搜索过的其他选项：

Host and run your PowerShell scripts in Azure
Azure Logic应用程序
Azure自动化

谢谢！

Answer 1

这是您想要的过程

创建HDInsight Spark集群

使用Power Shell，应该很容易创建HDInsight集群，这是示例代码：

### Create a Spark 2.3 cluster in Azure HDInsight

# Default cluster size (# of worker nodes), version, and type
$clusterSizeInNodes = "1"
$clusterVersion = "3.6"
$clusterType = "Spark"

# Create the resource group
$resourceGroupName = Read-Host -Prompt "Enter the resource group name"
$location = Read-Host -Prompt "Enter the Azure region to create resources in, such as 'Central US'"
$defaultStorageAccountName = Read-Host -Prompt "Enter the default storage account name"

New-AzResourceGroup -Name $resourceGroupName -Location $location

# Create an Azure storage account and container
# Note: Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.
New-AzStorageAccount `
    -ResourceGroupName $resourceGroupName `
    -Name $defaultStorageAccountName `
    -Location $location `
    -SkuName Standard_LRS `
    -Kind StorageV2 `
    -EnableHttpsTrafficOnly 1

$defaultStorageAccountKey = (Get-AzStorageAccountKey `
                                -ResourceGroupName $resourceGroupName `
                                -Name $defaultStorageAccountName)[0].Value

$defaultStorageContext = New-AzStorageContext `
                                -StorageAccountName $defaultStorageAccountName `
                                -StorageAccountKey $defaultStorageAccountKey

# Create a Spark 2.3 cluster
$clusterName = Read-Host -Prompt "Enter the name of the HDInsight cluster"

# Cluster login is used to secure HTTPS services hosted on the cluster
$httpCredential = Get-Credential -Message "Enter Cluster login credentials" -UserName "admin"

# SSH user is used to remotely connect to the cluster using SSH clients
$sshCredentials = Get-Credential -Message "Enter SSH user credentials" -UserName "sshuser"

# Set the storage container name to the cluster name
$defaultBlobContainerName = $clusterName

# Create a blob container. This holds the default data store for the cluster.
New-AzStorageContainer `
    -Name $clusterName `
    -Context $defaultStorageContext

$sparkConfig = New-Object "System.Collections.Generic.Dictionary``2[System.String,System.String]"
$sparkConfig.Add("spark", "2.3")

# Create the HDInsight cluster
New-AzHDInsightCluster `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $clusterName `
    -Location $location `
    -ClusterSizeInNodes $clusterSizeInNodes `
    -ClusterType $clusterType `
    -OSType "Linux" `
    -Version $clusterVersion `
    -ComponentVersion $sparkConfig `
    -HttpCredential $httpCredential `
    -DefaultStorageAccountName "$defaultStorageAccountName.blob.core.windows.net" `
    -DefaultStorageAccountKey $defaultStorageAccountKey `
    -DefaultStorageContainer $clusterName `
    -SshCredential $sshCredentials

Get-AzHDInsightCluster `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $clusterName

运行流程（一个Scala类）

您可以引用此链接将应用程序作业远程提交到Spark集群：

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-create-standalone-application#run-the-application-on-the-apache-spark-cluster

删除之前创建的HDInsight Spark集群

清理集群，可以使用powershell来实现，这是相同的示例代码；

# Removes the specified HDInsight cluster from the current subscription.
Remove-AzHDInsightCluster `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $clusterName

# Removes the specified storage container.
Remove-AzStorageContainer `
    -Name $clusterName `
    -Context $defaultStorageContext

# Removes a Storage account from Azure.
Remove-AzStorageAccount `
    -ResourceGroupName $resourceGroupName `
    -Name $defaultStorageAccountName

# Removes a resource group.
Remove-AzResourceGroup `
    -Name $resourceGroupName

其他参考：

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql-use-powershell

https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-powershell.md

希望有帮助。

在Azure中创建数据管道

1 个答案: