Azure数据工厂从S3存储桶到ADLS的复制管道中截断文件

时间:2018-07-16 13:26:40

标签: azure amazon-s3 azure-data-factory azure-data-lake amazon-data-pipeline

我有一个复制管道,可以使用数据工厂将少量文件从S3存储桶中的日常文件夹复制到Azure中的数据湖中。我遇到了这个非常奇怪的问题。

假设S3存储桶中有三个文件。一个是30MB,另一个是50MB,最后一个是70MB。如果我将30M文件放在“第一个”(将其命名为test0.tsv),则它声称已成功将所有三个文件复制到ADLS。但是,第二个和第三个文件被截断为30M。每个文件的数据都是正确的,但是会被截断。如果我先放置70M文件,则它们都将被正确复制。因此,它使用第一个文件长度作为最大文件大小,并截断所有后续的较长文件。这对我来说也很令人担忧,因为Azure数据工厂声称它已成功复制了它们。

这是我的管道:

{
"name": "[redacted]Pipeline",
"properties": {
    "description": "[redacted]",
    "activities": [
        {
            "type": "Copy",
            "typeProperties": {
                "source": {
                    "type": "FileSystemSource",
                    "recursive": true
                },
                "sink": {
                    "type": "AzureDataLakeStoreSink",
                    "copyBehavior": "PreserveHierarchy",
                    "writeBatchSize": 0,
                    "writeBatchTimeout": "00:00:00"
                }
            },
            "inputs": [
                {
                    "name": "InputDataset"
                }
            ],
            "outputs": [
                {
                    "name": "OutputDataset"
                }
            ],
            "policy": {
                "retry": 3
            },
            "scheduler": {
                "frequency": "Day",
                "interval": 1
            },
            "name": "[redacted]"
        }
    ],
    "start": "2018-07-06T04:00:00Z",
    "end": "2018-07-30T04:00:00Z",
    "isPaused": false,
    "hubName": "[redacted]",
    "pipelineMode": "Scheduled"
}

}

这是我的输入数据集:

{
"name": "InputDataset",
"properties": {
    "published": false,
    "type": "AmazonS3",
    "linkedServiceName": "[redacted]",
    "typeProperties": {
        "bucketName": "[redacted",
        "key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
    },
    "availability": {
        "frequency": "Day",
        "interval": 1
    },
    "external": true,
    "policy": {}
}

}

这是我的输出数据集:

{
"name": "OutputDataset",
"properties": {
    "published": false,
    "type": "AzureDataLakeStore",
    "linkedServiceName": "[redacted]",
    "typeProperties": {
        "folderPath": "[redacted]/{Year}/{Month}/{Day}",
        "partitionedBy": [
            {
                "name": "Year",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "yyyy"
                }
            },
            {
                "name": "Month",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "MM"
                }
            },
            {
                "name": "Day",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "dd"
                }
            }
        ]
    },
    "availability": {
        "frequency": "Day",
        "interval": 1
    }
}

}

我已经删除了输入和输出数据集中的格式字段,因为我认为将其作为二进制副本可能会解决该问题,但这没用。

0 个答案:

没有答案