我有一个Azure数据工厂管道,通过该管道我需要从Blob存储容器中提取所有CSV文件并将其存储到Azure Data Lake容器中。在将这些文件存储到数据湖之前,我需要对该文件的数据进行一些数据操作。
现在,我需要按顺序而不是并行地执行此过程。因此,我使用ForEach Activity-> Settings-> Sequential。
但是它不能按顺序工作,而是作为并行过程工作。
{
"name":"PN_obfuscate_and_move",
"properties":{
"description":"move PN blob csv to adlgen2(obfuscated)",
"activities":[
{
"name":"GetBlobFileName",
"type":"GetMetadata",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"dataset":{
"referenceName":"PN_Getblobfilename_Dataset",
"type":"DatasetReference"
},
"fieldList":[
"childItems"
],
"storeSettings":{
"type":"AzureBlobStorageReadSetting",
"recursive":true
},
"formatSettings":{
"type":"DelimitedTextReadSetting"
}
}
},
{
"name":"ForEachBlobFile",
"type":"ForEach",
"dependsOn":[
{
"activity":"GetBlobFileName",
"dependencyConditions":[
"Succeeded"
]
}
],
"userProperties":[
],
"typeProperties":{
"items":{
"value":"@activity('GetBlobFileName').output.childItems",
"type":"Expression"
},
"isSequential":true,
"activities":[
{
"name":"Blob_to_SQLServer",
"description":"Copy PN blob files to sql server table",
"type":"Copy",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
{
"name":"Source",
"value":"PNemailattachment//"
},
{
"name":"Destination",
"value":"[dbo].[PN]"
}
],
"typeProperties":{
"source":{
"type":"DelimitedTextSource",
"storeSettings":{
"type":"AzureBlobStorageReadSetting",
"recursive":false,
"wildcardFileName":"*.*",
"enablePartitionDiscovery":false
},
"formatSettings":{
"type":"DelimitedTextReadSetting"
}
},
"sink":{
"type":"AzureSqlSink"
},
"enableStaging":false
},
"inputs":[
{
"referenceName":"PNBlob",
"type":"DatasetReference"
}
],
"outputs":[
{
"referenceName":"PN_SQLServer",
"type":"DatasetReference"
}
]
},
{
"name":"Obfuscate_PN_SQLData",
"description":"mask specific columns",
"type":"SqlServerStoredProcedure",
"dependsOn":[
{
"activity":"Blob_to_SQLServer",
"dependencyConditions":[
"Succeeded"
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"storedProcedureName":"[dbo].[Obfuscate_PN_Data]"
},
"linkedServiceName":{
"referenceName":"PN_SQLServer",
"type":"LinkedServiceReference"
}
},
{
"name":"SQLServer_to_ADLSGen2",
"description":"move PN obfuscated data to azure data lake gen2",
"type":"Copy",
"dependsOn":[
{
"activity":"Obfuscate_PN_SQLData",
"dependencyConditions":[
"Succeeded"
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"source":{
"type":"AzureSqlSource"
},
"sink":{
"type":"DelimitedTextSink",
"storeSettings":{
"type":"AzureBlobFSWriteSetting"
},
"formatSettings":{
"type":"DelimitedTextWriteSetting",
"quoteAllText":true,
"fileExtension":".csv"
}
},
"enableStaging":false
},
"inputs":[
{
"referenceName":"PN_SQLServer",
"type":"DatasetReference"
}
],
"outputs":[
{
"referenceName":"PNADLSGen2",
"type":"DatasetReference"
}
]
},
{
"name":"Delete_PN_SQLData",
"description":"delete all data from table",
"type":"SqlServerStoredProcedure",
"dependsOn":[
{
"activity":"SQLServer_to_ADLSGen2",
"dependencyConditions":[
"Succeeded"
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"storedProcedureName":"[dbo].[Delete_PN_Data]"
},
"linkedServiceName":{
"referenceName":"PN_SQLServer",
"type":"LinkedServiceReference"
}
}
]
}
}
],
"folder":{
"name":"PN"
},
"annotations":[
]
},
"type":"Microsoft.DataFactory/factories/pipelines"
}
答案 0 :(得分:1)
Azure数据工厂(ADF)中的ForEach activity默认情况下最多并行运行20个任务。您最多可以运行50个。如果要强制按顺序运行,即一个接一个地运行,则可以在ForEach UI的“设置”部分上设置“顺序”复选框(请参见下面)或将JSON中ForEach活动的isSequential
属性设置为true,例如
{
"name": "<MyForEachPipeline>",
"properties": {
"activities": [
{
"name": "<MyForEachActivity>",
"type": "ForEach",
"typeProperties": {
"isSequential": "true",
"items": {
...
我会提醒您使用此设置。连续运行(即一个接一个)运行会降低运行速度。您是否可以通过另一种方法来设计工作流,以利用Azure Data Factory的这一真正强大的功能?这样,您的工作将只需要最长的任务,而不是所有任务的总和。
比方说,我有一份工作要运行,其中有10项任务每个耗时1秒。如果我以串行方式运行此作业,则将花费10秒,但如果以并行方式运行,则将花费1秒。
SSIS从来没有真正拥有过-您可以手动创建多个路径,也可以使用第三方组件,但它不是内置的。这确实是ADF的绝佳功能,您应该尝试利用它。当然,有时候确实需要串行运行,这就是为什么可以使用此选项的原因。