如何获取Azure数据工厂以循环浏览文件夹中的文件

时间:2019-08-16 21:41:03

标签: azure azure-data-factory

我正在查看下面的链接。

https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/

我们应该能够在文件夹路径和文件名中使用通配符。如果单击“活动”,然后单击“源”,则会看到此视图。

我想每天循环浏览几个月,所以应该像这样的视图。

enter image description here

当然,这实际上不起作用。我收到以下错误消息:ErrorCode:“ PathNotFound”。消息:“指定的路径不存在。”。给定文件路径和文件名中的特定字符串模式,如何获得该工具以递归方式遍历所有文件夹中的所有文件?谢谢。

1 个答案:

答案 0 :(得分:1)

  

我想每天循环浏览几个月

  • 为此,您可以将两个参数从管道传递到活动,以便可以基于这些参数动态构建路径。 ADF V2允许您传递参数。

让我们一一开始:

1。创建一个管道并为您的月份和日期传递两个参数。

注意:如果需要,也可以从其他活动的输出中传递此参数。参考:Parameters in ADF
Passing Params to the copy activity through pipeline
2。创建两个数据集。

2.1接收器数据集-此处为Blob存储。将其与您的链接服务链接,并提供容器名称(确保它存在)。同样,如果需要,可以将其作为参数传递。 enter image description here

2.2源数据集-再次在此处存储Blob或根据您的需要。将其与您的链接服务链接,并提供容器名称(确保它存在)。同样,如果需要,可以将其作为参数传递。 enter image description here
注意: 1.文件夹路径决定复制数据的路径。如果该容器不存在,则将为您创建活动,并且如果该文件已存在,则默认情况下该文件将被覆盖。

2.如果要动态构建输出路径,请在数据集中传递参数。在这里,我为数据集创建了两个参数,分别名为monthcopy和datacopy。

3。在管道中创建复制活动。

通配符文件夹路径:

    @{concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),'/',string(pipeline().parameters.month),'/',string(pipeline().parameters.day),'/*')}

where:
    The path will become as: current-yyyy/month-passed/day-passed/* (the * will take any folder on one level)

enter image description here enter image description here

测试结果:

enter image description here

管道的JSON模板:

{
    "name": "pipeline2",
    "properties": {
        "activities": [
            {
                "name": "Copy Data1",
                "type": "Copy",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "source": {
                        "type": "DelimitedTextSource",
                        "storeSettings": {
                            "type": "AzureBlobStorageReadSettings",
                            "recursive": true,
                            "wildcardFolderPath": {
                                "value": "@{concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),'/',string(pipeline().parameters.month),'/',string(pipeline().parameters.day),'/*')}",
                                "type": "Expression"
                            },
                            "wildcardFileName": "*.csv",
                            "enablePartitionDiscovery": false
                        },
                        "formatSettings": {
                            "type": "DelimitedTextReadSettings"
                        }
                    },
                    "sink": {
                        "type": "DelimitedTextSink",
                        "storeSettings": {
                            "type": "AzureBlobStorageWriteSettings"
                        },
                        "formatSettings": {
                            "type": "DelimitedTextWriteSettings",
                            "quoteAllText": true,
                            "fileExtension": ".csv"
                        }
                    },
                    "enableStaging": false
                },
                "inputs": [
                    {
                        "referenceName": "DelimitedText1",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "DelimitedText2",
                        "type": "DatasetReference",
                        "parameters": {
                            "monthcopy": {
                                "value": "@pipeline().parameters.month",
                                "type": "Expression"
                            },
                            "datacopy": {
                                "value": "@pipeline().parameters.day",
                                "type": "Expression"
                            }
                        }
                    }
                ]
            }
        ],
        "parameters": {
            "month": {
                "type": "string"
            },
            "day": {
                "type": "string"
            }
        },
        "annotations": []
    }
}

用于SINK数据集的JSON模板:

{
    "name": "DelimitedText1",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage1",
            "type": "LinkedServiceReference"
        },
        "annotations": [],
        "type": "DelimitedText",
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "corpdata"
            },
            "columnDelimiter": ",",
            "escapeChar": "\\",
            "quoteChar": "\""
        },
        "schema": []
    }
}

源数据集的JSON模板:

{
    "name": "DelimitedText2",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage1",
            "type": "LinkedServiceReference"
        },
        "parameters": {
            "monthcopy": {
                "type": "string"
            },
            "datacopy": {
                "type": "string"
            }
        },
        "annotations": [],
        "type": "DelimitedText",
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "folderPath": {
                    "value": "@concat(formatDateTime(adddays(utcnow(),-1),'yyyy'),dataset().monthcopy,'/',dataset().datacopy)",
                    "type": "Expression"
                },
                "container": "copycorpdata"
            },
            "columnDelimiter": ",",
            "escapeChar": "\\",
            "quoteChar": "\""
        },
        "schema": []
    }
}